talk-data.com talk-data.com

Topic

Iceberg

Apache Iceberg

table_format data_lake schema_evolution file_format storage open_table_format

78

tagged

Activity Trend

39 peak/qtr
2020-Q1 2026-Q1

Activities

78 activities · Newest first

Summary In this episode of the Data Engineering Podcast Hannes Mühleisen and Mark Raasveldt, the creators of DuckDB, share their work on Duck Lake, a new entrant in the open lakehouse ecosystem. They discuss how Duck Lake, is focused on simplicity, flexibility, and offers a unified catalog and table format compared to other lakehouse formats like Iceberg and Delta. Hannes and Mark share insights into how Duck Lake revolutionizes data architecture by enabling local-first data processing, simplifying deployment of lakehouse solutions, and offering benefits such as encryption features, data inlining, and integration with existing ecosystems.

Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementData teams everywhere face the same problem: they're forcing ML models, streaming data, and real-time processing through orchestration tools built for simple ETL. The result? Inflexible infrastructure that can't adapt to different workloads. That's why Cash App and Cisco rely on Prefect. Cash App's fraud detection team got what they needed - flexible compute options, isolated environments for custom packages, and seamless data exchange between workflows. Each model runs on the right infrastructure, whether that's high-memory machines or distributed compute. Orchestration is the foundation that determines whether your data team ships or struggles. ETL, ML model training, AI Engineering, Streaming - Prefect runs it all from ingestion to activation in one platform. Whoop and 1Password also trust Prefect for their data operations. If these industry leaders use Prefect for critical workflows, see what it can do for you at dataengineeringpodcast.com/prefect.Data migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details. Your host is Tobias Macey and today I'm interviewing Hannes Mühleisen and Mark Raasveldt about DuckLake, the latest entrant into the open lakehouse ecosystemInterview IntroductionHow did you get involved in the area of data management?Can you describe what DuckLake is and the story behind it?What are the particular problems that DuckLake is solving for?How does this compare to the capabilities of MotherDuck?Iceberg and Delta already have a well established ecosystem, but so does DuckDB. Who are the primary personas that you are trying to focus on in these early days of DuckLake?One of the major factors driving the adoption of formats like Iceberg is cost efficiency for large volumes of data. That brings with it challenges of large batch processing of data. How does DuckLake account for these axes of scale?There is also a substantial investment in the ecosystem of technologies that support Iceberg. The most notable ecosystem challenge for DuckDB and DuckLake is in the query layer. How are you thinking about the evolution and growth of that capability beyond DuckDB (e.g. support in Trino/Spark/Flink)?What are your opinions on the viability of a future where DuckLake and Iceberg become a unified standard and implementation? (why can't Iceberg REST catalog implementations just use DuckLake under the hood?)Digging into the specifics of the specification and implementation, what are some of the capabilities that it offers above and beyond Iceberg?Is it now possible to enforce PK/FK constraints, indexing on underlying data?Given that DuckDB has a vector type, how do you think about the support for vector storage/indexing?How do the capabilities of DuckLake and the integration with DuckDB change the ways that data teams design their data architecture and access patterns?What are your thoughts on the impact of "data gravity" in today's data ecosystem, with engines like DuckDB, KuzuDB, LanceDB, etc. available for embedded and edge use cases?What are the most interesting, innovative, or unexpected ways that you have seen DuckLake used?What are the most interesting, unexpected, or challenging lessons that you have learned while working on DuckLake?When is DuckLake the wrong choice?What do you have planned for the future of DuckLake?Contact Info HannesWebsiteMarkWebsiteParting Question From your perspective, what is the biggest gap in the tooling or technology for data management today?Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story.Links DuckDBPodcast EpisodeDuckLakeDuckDB LabsMySQLCWIMonetDBIcebergIceberg REST CatalogDeltaHudiLanceDuckDB Iceberg ConnectorACID == Atomicity, Consistency, Isolation, DurabilityMotherDuckMotherDuck Managed DuckLakeTrinoSparkPrestoSpark DuckLake DemoDelta KernelArrowdltS3 TablesAttribute Based Access Control (ABAC)ParquetArrow FlightHadoopHDFSDuckLake RoadmapThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Tristan digs deep into the world of Apache Iceberg. There's a lot happening beneath the surface: multiple catalog interfaces, evolving REST specs, and competing implementations across open source, proprietary, and academic contexts. Christian Thiel, co-founder of Lakekeeper, one of the most widely used Iceberg catalogs, joins to walk through the state of the Iceberg ecosystem. For full show notes and to read 6+ years of back issues of the podcast's companion newsletter, head to https://roundup.getdbt.com. The Analytics Engineering Podcast is sponsored by dbt Labs.

Summary In this episode of the Data Engineering Podcast Prashanth Rao, an AI engineer at KuzuDB, talks about their embeddable graph database. Prashanth explains how KuzuDB addresses performance shortcomings in existing solutions through columnar storage and novel join algorithms. He discusses the usability and scalability of KuzuDB, emphasizing its open-source nature and potential for various graph applications. The conversation explores the growing interest in graph databases due to their AI and data engineering applications, and Prashanth highlights KuzuDB's potential in edge computing, ephemeral workloads, and integration with other formats like Iceberg and Parquet.

Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementData migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details.Your host is Tobias Macey and today I'm interviewing Prashanth Rao about KuzuDB, an embeddable graph databaseInterview IntroductionHow did you get involved in the area of data management?Can you describe what KuzuDB is and the story behind it?What are the core use cases that Kuzu is focused on addressing?What is explicitly out of scope?Graph engines have been available and in use for a long time, but generally for more niche use cases. How would you characterize the current state of the graph data ecosystem?You note scalability as a feature of Kuzu, which is a phrase with many potential interpretations. Typically horizontal scaling of graphs has been complicated, in what sense does Kuzu make that claim?Can you describe some of the typical architecture and integration patterns of Kuzu?What are some of the more interesting or esoteric means of architecting with Kuzu?For cases where Kuzu is rendering a graph across an external data repository (e.g. Iceberg, etc.), what are the patterns for balancing data freshness with network/compute efficiency? (e.g. read and create every time or persist the Kuzu state)Can you describe the internal architecture of Kuzu and key design factors?What are the benefits and tradeoffs of using a columnar store with adjacency lists vs. a more graph-native storage format?What are the most interesting, innovative, or unexpected ways that you have seen Kuzu used?What are the most interesting, unexpected, or challenging lessons that you have learned while working on Kuzu?When is Kuzu the wrong choice?What do you have planned for the future of Kuzu?Contact Info WebsiteLinkedInParting Question From your perspective, what is the biggest gap in the tooling or technology for data management today?Links KuzuDBBERTTransformer ArchitectureDuckDBPodcast EpisodeMonetDBUmbra DBsqliteCypher Query LanguageProperty GraphNeo4JGraphRAGContext EngineeringWrite-Ahead LogBauplanIcebergDuckLakeLanceLanceDBArrowPolarsArrow DataFusionGQLClickHouseAdjacency ListWhy Graph Databases Need New Join AlgorithmsKuzuDB WASMRAG == Retrieval Augmented GenerationNetworkXThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Summary In this episode of the Data Engineering Podcast Andy Warfield talks about the innovative functionalities of S3 Tables and Vectors and their integration into modern data stacks. Andy shares his journey through the tech industry and his role at Amazon, where he collaborates to enhance storage capabilities, discussing the evolution of S3 from a simple storage solution to a sophisticated system supporting advanced data types like tables and vectors crucial for analytics and AI-driven applications. He explains the motivations behind introducing S3 Tables and Vectors, highlighting their role in simplifying data management and enhancing performance for complex workloads, and shares insights into the technical challenges and design considerations involved in developing these features. The conversation explores potential applications of S3 Tables and Vectors in fields like AI, genomics, and media, and discusses future directions for S3's development to further support data-driven innovation.

Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementTired of data migrations that drag on for months or even years? What if I told you there's a way to cut that timeline by up to 6x while guaranteeing accuracy? Datafold's Migration Agent is the only AI-powered solution that doesn't just translate your code; it validates every single data point to ensure perfect parity between your old and new systems. Whether you're moving from Oracle to Snowflake, migrating stored procedures to dbt, or handling complex multi-system migrations, they deliver production-ready code with a guaranteed timeline and fixed price. Stop burning budget on endless consulting hours. Visit dataengineeringpodcast.com/datafold to book a demo and see how they're turning months-long migration nightmares into week-long success stories.Your host is Tobias Macey and today I'm interviewing Andy Warfield about S3 Tables and VectorsInterview IntroductionHow did you get involved in the area of data management?Can you describe what your goals are with the Tables and Vector features of S3?How did the experience of building S3 Tables inform your work on S3 Vectors?There are numerous implementations of vector storage and search. How do you view the role of S3 in the context of that ecosystem?The most directly analogous implementation that I'm aware of is the Lance table format. How would you compare the implementation and capabilities of Lance with what you are building with S3 Vectors?What opportunity do you see for being able to offer a protocol compatible implementation similar to the Iceberg compatibility that you provide with S3 Tables?Can you describe the technical implementation of the Vectors functionality in S3?What are the sources of inspiration that you looked to in designing the service?Can you describe some of the ways that S3 Vectors might be integrated into a typical AI application?What are the most interesting, innovative, or unexpected ways that you have seen S3 Tables/Vectors used?What are the most interesting, unexpected, or challenging lessons that you have learned while working on S3 Tables/Vectors?When is S3 the wrong choice for Iceberg or Vector implementations?What do you have planned for the future of S3 Tables and Vectors?Contact Info LinkedInParting Question From your perspective, what is the biggest gap in the tooling or technology for data management today?Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story.Links S3 TablesS3 VectorsS3 ExpressParquetIcebergVector IndexVector DatabasepgvectorEmbedding ModelRetrieval Augmented GenerationTwelveLabsAmazon BedrockIceberg REST CatalogLog-Structured Merge TreeS3 MetadataSentence TransformerSparkTrinoDaftThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

In this season of the Analytics Engineering podcast, Tristan is deep into the world of developer tools and databases. If you're following us here, you've almost definitely used Amazon S3 it and its Blob Storage siblings. They form the foundation for nearly all data work in the cloud. In many ways, it was the innovations that happened inside of S3 that have unlocked all of the progress in cloud data over the last decade. In this episode, Tristan talks with Andy Warfield, VP and senior principal engineer at AWS, where he focuses primarily on storage. They go deep on S3, how it works, and what it unlocks. They close out italking about Iceberg, S3 table buckets, and what this all suggests about the outlines of the S3 product roadmap moving forward. For full show notes and to read 6+ years of back issues of the podcast's companion newsletter, head to https://roundup.getdbt.com. The Analytics Engineering Podcast is sponsored by dbt Labs.

This is a free preview of a paid episode. To hear more, visit dataengineeringcentral.substack.com

Hello! A new episode of the Data Engineering Central Podcast is dropping today, we will be covering a few hot topics! * Apache Iceberg Catalogs * new Boring Catalog * new full Iceberg support from Databricks/Unity Catalog * Databricks SQL Scripting * DuckDB coming to a Lake House near you * Lakebase from Databricks Going to be a great show, come along for the ride! Thanks …

Summary In this episode of the Data Engineering Podcast Alex Albu, tech lead for AI initiatives at Starburst, talks about integrating AI workloads with the lakehouse architecture. From his software engineering roots to leading data engineering efforts, Alex shares insights on enhancing Starburst's platform to support AI applications, including an AI agent for data exploration and using AI for metadata enrichment and workload optimization. He discusses the challenges of integrating AI with data systems, innovations like SQL functions for AI tasks and vector databases, and the limitations of traditional architectures in handling AI workloads. Alex also shares his vision for the future of Starburst, including support for new data formats and AI-driven data exploration tools.

Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementData migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details.This is a pharmaceutical Ad for Soda Data Quality. Do you suffer from chronic dashboard distrust? Are broken pipelines and silent schema changes wreaking havoc on your analytics? You may be experiencing symptoms of Undiagnosed Data Quality Syndrome — also known as UDQS. Ask your data team about Soda. With Soda Metrics Observability, you can track the health of your KPIs and metrics across the business — automatically detecting anomalies before your CEO does. It’s 70% more accurate than industry benchmarks, and the fastest in the category, analyzing 1.1 billion rows in just 64 seconds. And with Collaborative Data Contracts, engineers and business can finally agree on what “done” looks like — so you can stop fighting over column names, and start trusting your data again.Whether you’re a data engineer, analytics lead, or just someone who cries when a dashboard flatlines, Soda may be right for you. Side effects of implementing Soda may include: Increased trust in your metrics, reduced late-night Slack emergencies, spontaneous high-fives across departments, fewer meetings and less back-and-forth with business stakeholders, and in rare cases, a newfound love of data. Sign up today to get a chance to win a $1000+ custom mechanical keyboard. Visit dataengineeringpodcast.com/soda to sign up and follow Soda’s launch week. It starts June 9th. This episode is brought to you by Coresignal, your go-to source for high-quality public web data to power best-in-class AI products. Instead of spending time collecting, cleaning, and enriching data in-house, use ready-made multi-source B2B data that can be smoothly integrated into your systems via APIs or as datasets. With over 3 billion data records from 15+ online sources, Coresignal delivers high-quality data on companies, employees, and jobs. It is powering decision-making for more than 700 companies across AI, investment, HR tech, sales tech, and market intelligence industries. A founding member of the Ethical Web Data Collection Initiative, Coresignal stands out not only for its data quality but also for its commitment to responsible data collection practices. Recognized as the top data provider by Datarade for two consecutive years, Coresignal is the go-to partner for those who need fresh, accurate, and ethically sourced B2B data at scale. Discover how Coresignal's data can enhance your AI platforms. Visit dataengineeringpodcast.com/coresignal to start your free 14-day trial.Your host is Tobias Macey and today I'm interviewing Alex Albu about how Starburst is extending the lakehouse to support AI workloadsInterview IntroductionHow did you get involved in the area of data management?Can you start by outlining the interaction points of AI with the types of data workflows that you are supporting with Starburst?What are some of the limitations of warehouse and lakehouse systems when it comes to supporting AI systems?What are the points of friction for engineers who are trying to employ LLMs in the work of maintaining a lakehouse environment?Methods such as tool use (exemplified by MCP) are a means of bolting on AI models to systems like Trino. What are some of the ways that is insufficient or cumbersome?Can you describe the technical implementation of the AI-oriented features that you have incorporated into the Starburst platform?What are the foundational architectural modifications that you had to make to enable those capabilities?For the vector storage and indexing, what modifications did you have to make to iceberg?What was your reasoning for not using a format like Lance?For teams who are using Starburst and your new AI features, what are some examples of the workflows that they can expect?What new capabilities are enabled by virtue of embedding AI features into the interface to the lakehouse?What are the most interesting, innovative, or unexpected ways that you have seen Starburst AI features used?What are the most interesting, unexpected, or challenging lessons that you have learned while working on AI features for Starburst?When is Starburst/lakehouse the wrong choice for a given AI use case?What do you have planned for the future of AI on Starburst?Contact Info LinkedInParting Question From your perspective, what is the biggest gap in the tooling or technology for data management today?Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story.Links StarburstPodcast EpisodeAWS AthenaMCP == Model Context ProtocolLLM Tool UseVector EmbeddingsRAG == Retrieval Augmented GenerationAI Engineering Podcast EpisodeStarburst Data ProductsLanceLanceDBParquetORCpgvectorStarburst IcehouseThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Hello, my fair-weathered friends and readers! I am gone on vacation this week with my family, probably at this moment lying in the sand on a beach (Lord willing the creek don’t rise), not thinking of you all. Anywho, be that as it may, I didn’t want you to miss my pretty face, so here is a video of me ranting about Apache Iceberg, something I’ve had a lot of practice doing and enjoy quite thoroughly. For all you free-loaders out there, you can get 20% off to celebrate Memorial Day. https://dataengineeringcentral.substack.com/Merica

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit dataengineeringcentral.substack.com/subscribe

Summary In this episode of the Data Engineering Podcast Tulika Bhatt, a senior software engineer at Netflix, talks about her experiences with large-scale data processing and the future of data engineering technologies. Tulika shares her journey into the data engineering field, discussing her work at BlackRock and Verizon before joining Netflix, and explains the challenges and innovations involved in managing Netflix's impression data for personalization and user experience. She highlights the importance of balancing off-the-shelf solutions with custom-built systems using technologies like Spark, Flink, and Iceberg, and delves into the complexities of ensuring data quality and observability in high-speed environments, including robust alerting strategies and semantic data auditing.

Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementData migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details.Your host is Tobias Macey and today I'm interviewing Tulika Bhatt about her experiences working on large scale data processing and her insights on the future trajectory of the supporting technologiesInterview IntroductionHow did you get involved in the area of data management?Can you start by outlining the ways that operating at large scale change the ways that you need to think about the design of data systems?When dealing with small-scale data systems it can be feasible to have manual processes. What are the elements of large scal data systems that demand autopmation?How can those large-scale automation principles be down-scaled to the systems that the rest of the world are operating?A perennial problem in data engineering is that of data quality. The past 4 years has seen a significant growth in the number of tools and practices available for automating the validation and verification of data. In your experience working with high volume data flows, what are the elements of data validation that are still unsolved?Generative AI has taken the world by storm over the past couple years. How has that changed the ways that you approach your daily work?What do you see as the future realities of working with data across various axes of large scale, real-time, etc.?What are the most interesting, innovative, or unexpected ways that you have seen solutions to large-scale data management designed?What are the most interesting, unexpected, or challenging lessons that you have learned while working on data management across axes of scale?What are the ways that you are thinking about the future trajectory of your work??Contact Info LinkedInParting Question From your perspective, what is the biggest gap in the tooling or technology for data management today?Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story.Links BlackRockSparkFlinkKafkaCassandraRocksDBNetflix Maestro workflow orchestratorPagerdutyIcebergThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Summary In this episode of the Data Engineering Podcast Sida Shen, product manager at CelerData, talks about StarRocks, a high-performance analytical database. Sida discusses the inception of StarRocks, which was forked from Apache Doris in 2020 and evolved into a high-performance Lakehouse query engine. He explains the architectural design of StarRocks, highlighting its capabilities in handling high concurrency and low latency queries, and its integration with open table formats like Apache Iceberg, Delta Lake, and Apache Hudi. Sida also discusses how StarRocks differentiates itself from other query engines by supporting on-the-fly joins and eliminating the need for denormalization pipelines, and shares insights into its use cases, such as customer-facing analytics and real-time data processing, as well as future directions for the platform.

Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementData migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details.Your host is Tobias Macey and today I'm interviewing Sida Shen about StarRocks, a high performance analytical database supporting shared nothing and shared data patternsInterview IntroductionHow did you get involved in the area of data management?Can you describe what StarRocks is and the story behind it?There are numerous analytical databases on the market. What are the attributes of StarRocks that differentiate it from other options?Can you describe the architecture of StarRocks?What are the "-ilities" that are foundational to the design of the system?How have the design and focus of the project evolved since it was first created?What are the tradeoffs involved in separating the communication layer from the data layers?The tiered architecture enables the shared nothing and shared data behaviors, which allows for the implementation of lakehouse patterns. What are some of the patterns that are possible due to the single interface/dual pattern nature of StarRocks?The shared data implementation has cacheing built in to accelerate interaction with datasets. What are some of the limitations/edge cases that operators and consumers should be aware of?StarRocks supports management of lakehouse tables (Iceberg, Delta, Hudi, etc.), which overlaps with use cases for Trino/Presto/Dremio/etc. What are the cases where StarRocks acts as a replacement for those systems vs. a supplement to them?The other major category of engines that StarRocks overlaps with is OLAP databases (e.g. Clickhouse, Firebolt, etc.). Why might someone use StarRocks in addition to or in place of those techologies?We would be remiss if we ignored the dominating trend of AI and the systems that support it. What is the role of StarRocks in the context of an AI application?What are the most interesting, innovative, or unexpected ways that you have seen StarRocks used?What are the most interesting, unexpected, or challenging lessons that you have learned while working on StarRocks?When is StarRocks the wrong choice?What do you have planned for the future of StarRocks?Contact Info LinkedInParting Question From your perspective, what is the biggest gap in the tooling or technology for data management today?Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story.Links StarRocksCelerDataApache DorisSIMD == Single Instruction Multiple DataApache IcebergClickHousePodcast EpisodeDruidFireboltPodcast EpisodeSnowflakeBigQueryTrinoDatabricksDremioData LakehouseDelta LakeApache HiveC++Cost-Based OptimizerIceberg Summit Tencent Games PresentationApache PaimonLancePodcast EpisodeDelta UniformApache ArrowStarRocks Python UDFDebeziumPodcast EpisodeThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Summary In this episode of the Data Engineering Podcast Viktor Kessler, co-founder of Vakmo, talks about the architectural patterns in the lake house enabled by a fast and feature-rich Iceberg catalog. Viktor shares his journey from data warehouses to developing the open-source project, Lakekeeper, an Apache Iceberg REST catalog written in Rust that facilitates building lake houses with essential components like storage, compute, and catalog management. He discusses the importance of metadata in making data actionable, the evolution of data catalogs, and the challenges and innovations in the space, including integration with OpenFGA for fine-grained access control and managing data across formats and compute engines.

Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementData migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details.Your host is Tobias Macey and today I'm interviewing Viktor Kessler about architectural patterns in the lakehouse that are unlocked by a fast and feature-rich Iceberg catalogInterview IntroductionHow did you get involved in the area of data management?Can you describe what LakeKeeper is and the story behind it? What is the core of the problem that you are addressing?There has been a lot of activity in the catalog space recently. What are the driving forces that have highlighted the need for a better metadata catalog in the data lake/distributed data ecosystem?How would you characterize the feature sets/problem spaces that different entrants are focused on addressing?Iceberg as a table format has gained a lot of attention and adoption across the data ecosystem. The REST catalog format has opened the door for numerous implementations. What are the opportunities for innovation and improving user experience in that space?What is the role of the catalog in managing security and governance? (AuthZ, auditing, etc.)What are the channels for propagating identity and permissions to compute engines? (how do you avoid head-scratching about permission denied situations)Can you describe how LakeKeeper is implemented?How have the design and goals of the project changed since you first started working on it?For someone who has an existing set of Iceberg tables and catalog, what does the migration process look like?What new workflows or capabilities does LakeKeeper enable for data teams using Iceberg tables across one or more compute frameworks?What are the most interesting, innovative, or unexpected ways that you have seen LakeKeeper used?What are the most interesting, unexpected, or challenging lessons that you have learned while working on LakeKeeper?When is LakeKeeper the wrong choice?What do you have planned for the future of LakeKeeper?Contact Info LinkedInParting Question From your perspective, what is the biggest gap in the tooling or technology for data management today?Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story.Links LakeKeeperSAPMicrosoft AccessMicrosoft ExcelApache IcebergPodcast EpisodeIceberg REST CatalogPyIcebergSparkTrinoDremioHive MetastoreHadoopNATSPolarsDuckDBPodcast EpisodeDataFusionAtlanPodcast EpisodeOpen MetadataPodcast EpisodeApache AtlasOpenFGAHudiPodcast EpisodeDelta LakePodcast EpisodeLance Table FormatPodcast EpisodeUnity CatalogPolaris CatalogApache GravitinoPodcast Episode KeycloakOpen Policy Agent (OPA)Apache RangerApache NiFiThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Summary In this episode of the Data Engineering Podcast Rajan Goyal, CEO and co-founder of Datapelago, talks about improving efficiencies in data processing by reimagining system architecture. Rajan explains the shift from hyperconverged to disaggregated and composable infrastructure, highlighting the importance of accelerated computing in modern data centers. He discusses the evolution from proprietary to open, composable stacks, emphasizing the role of open table formats and the need for a universal data processing engine, and outlines Datapelago's strategy to leverage existing frameworks like Spark and Trino while providing accelerated computing benefits.

Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementData migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details.Your host is Tobias Macey and today I'm interviewing Rajan Goyal about how to drastically improve efficiencies in data processing by re-imagining the system architectureInterview IntroductionHow did you get involved in the area of data management?Can you start by outlining the main factors that contribute to performance challenges in data lake environments?The different components of open data processing systems have evolved from different starting points with different objectives. In your experience, how has that un-planned and un-synchronized evolution of the ecosystem hindered the capabilities and adoption of open technologies?The introduction of a new cross-cutting capability (e.g. Iceberg) has typically taken a substantial amount of time to gain support across different engines and ecosystems. What do you see as the point of highest leverage to improve the capabilities of the entire stack with the least amount of co-ordination?What was the motivating insight that led you to invest in the technology that powers Datapelago?Can you describe the system design of Datapelago and how it integrates with existing data engines?The growth in the generation and application of unstructured data is a notable shift in the work being done by data teams. What are the areas of overlap in the fundamental nature of data (whether structured, semi-structured, or unstructured) that you are able to exploit to bridge the processing gap?What are the most interesting, innovative, or unexpected ways that you have seen Datapelago used?What are the most interesting, unexpected, or challenging lessons that you have learned while working on Datapelago?When is Datapelago the wrong choice?What do you have planned for the future of Datapelago?Contact Info LinkedInParting Question From your perspective, what is the biggest gap in the tooling or technology for data management today?Links DatapelagoMIPS ArchitectureARM ArchitectureAWS NitroMellanoxNvidiaVon Neumann ArchitectureTPU == Tensor Processing UnitFPGA == Field-Programmable Gate ArraySparkTrinoIcebergPodcast EpisodeDelta LakePodcast EpisodeHudiPodcast EpisodeApache GlutenIntermediate RepresentationTuring CompletenessLLVMAmdahl's LawLSTM == Long Short-Term MemoryThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

In this podcast episode, we talked with Adrian Brudaru about ​the past, present and future of data engineering.

About the speaker: Adrian Brudaru studied economics in Romania but soon got bored with how creative the industry was, and chose to go instead for the more factual side. He ended up in Berlin at the age of 25 and started a role as a business analyst. At the age of 30, he had enough of startups and decided to join a corporation, but quickly found out that it did not provide the challenge he wanted. As going back to startups was not a desirable option either, he decided to postpone his decision by taking freelance work and has never looked back since. Five years later, he co-founded a company in the data space to try new things. This company is also looking to release open source tools to help democratize data engineering.

0:00 Introduction to DataTalks.Club 1:05 Discussing trends in data engineering with Adrian 2:03 Adrian's background and journey into data engineering 5:04 Growth and updates on Adrian's company, DLT Hub 9:05 Challenges and specialization in data engineering today 13:00 Opportunities for data engineers entering the field 15:00 The "Modern Data Stack" and its evolution 17:25 Emerging trends: AI integration and Iceberg technology 27:40 DuckDB and the emergence of portable, cost-effective data stacks 32:14 The rise and impact of dbt in data engineering 34:08 Alternatives to dbt: SQLMesh and others 35:25 Workflow orchestration tools: Airflow, Dagster, Prefect, and GitHub Actions 37:20 Audience questions: Career focus in data roles and AI engineering overlaps 39:00 The role of semantics in data and AI workflows 41:11 Focusing on learning concepts over tools when entering the field 45:15 Transitioning from backend to data engineering: challenges and opportunities 47:48 Current state of the data engineering job market in Europe and beyond 49:05 Introduction to Apache Iceberg, Delta, and Hudi file formats 50:40 Suitability of these formats for batch and streaming workloads 52:29 Tools for streaming: Kafka, SQS, and related trends 58:07 Building AI agents and enabling intelligent data applications 59:09Closing discussion on the place of tools like DBT in the ecosystem

🔗 CONNECT WITH ADRIAN BRUDARU Linkedin -  / data-team   Website - https://adrian.brudaru.com/ 🔗 CONNECT WITH DataTalksClub Join the community - https://datatalks.club/slack.html Subscribe to our Google calendar to have all our events in your calendar - https://calendar.google.com/calendar/... Check other upcoming events - https://lu.ma/dtc-events LinkedIn -  /datatalks-club   Twitter -  /datatalksclub   Website - https://datatalks.club/

It’s time for another episode of the Data Engineering Central Podcast. In this episode, we cover … * AWS Lambda + DuckDB and Delta Lake (Polars, Daft, etc). * IAC - Long Live Terraform. * Databricks Data Quality with DQX. * Unity Catalog releases for DuckDB and Polars * Bespoke vs Managed Data Platforms * Delta Lake vs. Iceberg and UinFORM for a single table. Thanks for b…

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit dataengineeringcentral.substack.com/subscribe

A look inside at the data work happening at a company making some of the most advanced technologies in the industry. Rahul Jain, data engineering manager at Snowflake, joins Tristan to discuss Iceberg, streaming, and all things Snowflake.  For full show notes and to read 6+ years of back issues of the podcast's companion newsletter, head to https://roundup.getdbt.com. The Analytics Engineering Podcast is sponsored by dbt Labs.

This morning, a great article came across my feed that gave me PTSD, asking if Iceberg is the Hadoop of the Modern Data Stack?

In this rant, I bring the discussion back to a central question you should ask with any hot technology - do you need it at all? Do you need a tool built for the top 1% of companies at a sufficient data scale? Or is a spreadsheet good enough?

Link: https://blog.det.life/apache-iceberg-the-hadoop-of-the-modern-data-stack-c83f63a4ebb9

❤️ If you like my podcasts, please like and rate it on your favorite podcast platform.

🤓 My works:

📕Fundamentals of Data Engineering: https://www.oreilly.com/library/view/fundamentals-of-data/9781098108298/

🎥 Deeplearning.ai Data Engineering Certificate: https://www.coursera.org/professional-certificates/data-engineering

🔥Practical Data Modeling: https://practicaldatamodeling.substack.com/

🤓 My SubStack: https://joereis.substack.com/

In this episode, I had the pleasure of speaking with Ken Pickering, VP of Engineering at Going, about the intricacies of streaming data into a Trino and Iceberg lakehouse. Ken shared his journey from product engineering to becoming deeply involved in data-centric roles, highlighting his experiences in ecommerce and InsurTech. At Going, Ken leads the data platform team, focusing on finding travel deals for consumers, a task that involves handling massive volumes of flight data and event stream information.

Ken explained the dual approach of passive and active search strategies used by Going to manage the vast data landscape. Passive search involves aggregating data from global distribution systems, while active search is more transactional, querying specific flight prices. This approach helps Going sift through approximately 50 petabytes of data annually to identify the best travel deals.

We delved into the technical architecture supporting these operations, including the use of Confluent for data streaming, Starburst Galaxy for transformation, and Databricks for modeling. Ken emphasized the importance of an open lakehouse architecture, which allows for flexibility and scalability as the business grows.

Ken also discussed the composition of Going's engineering and data teams, highlighting the collaborative nature of their work and the reliance on vendor tooling to streamline operations. He shared insights into the challenges and strategies of managing data life cycles, ensuring data quality, and maintaining uptime for consumer-facing applications.

Throughout our conversation, Ken provided a glimpse into the future of Going's data architecture, including potential expansions into other travel modes and the integration of large language models for enhanced customer interaction. This episode offers a comprehensive look at the complexities and innovations in building a data-driven travel advisory service.

Summary In this episode of the Data Engineering Podcast, the creators of Feldera talk about their incremental compute engine designed for continuous computation of data, machine learning, and AI workloads. The discussion covers the concept of incremental computation, the origins of Feldera, and its unique ability to handle both streaming and batch data seamlessly. The guests explore Feldera's architecture, applications in real-time machine learning and AI, and challenges in educating users about incremental computation. They also discuss the balance between open-source and enterprise offerings, and the broader implications of incremental computation for the future of data management, predicting a shift towards unified systems that handle both batch and streaming data efficiently.

Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementImagine catching data issues before they snowball into bigger problems. That’s what Datafold’s new Monitors do. With automatic monitoring for cross-database data diffs, schema changes, key metrics, and custom data tests, you can catch discrepancies and anomalies in real time, right at the source. Whether it’s maintaining data integrity or preventing costly mistakes, Datafold Monitors give you the visibility and control you need to keep your entire data stack running smoothly. Want to stop issues before they hit production? Learn more at dataengineeringpodcast.com/datafold today!As a listener of the Data Engineering Podcast you clearly care about data and how it affects your organization and the world. For even more perspective on the ways that data impacts everything around us you should listen to Data Citizens® Dialogues, the forward-thinking podcast from the folks at Collibra. You'll get further insights from industry leaders, innovators, and executives in the world's largest companies on the topics that are top of mind for everyone. They address questions around AI governance, data sharing, and working at global scale. In particular I appreciate the ability to hear about the challenges that enterprise scale businesses are tackling in this fast-moving field. While data is shaping our world, Data Citizens Dialogues is shaping the conversation. Subscribe to Data Citizens Dialogues on Apple, Spotify, Youtube, or wherever you get your podcasts.Your host is Tobias Macey and today I'm interviewing Leonid Ryzhyk, Lalith Suresh, and Mihai Budiu about Feldera, an incremental compute engine for continous computation of data, ML, and AI workloadsInterview IntroductionCan you describe what Feldera is and the story behind it?DBSP (the theory behind Feldera) has won multiple awards from the database research community. Can you explain what it is and how it solves the incremental computation problem?Depending on which angle you look at it, Feldera has attributes of data warehouses, federated query engines, and stream processors. What are the unique use cases that Feldera is designed to address?In what situations would you replace another technology with Feldera?When is it an additive technology?Can you describe the architecture of Feldera?How have the design and scope evolved since you first started working on it?What are the state storage interfaces available in Feldera?What are the opportunities for integrating with or building on top of open table formats like Iceberg, Lance, Hudi, etc.?Can you describe a typical workflow for an engineer building with Feldera?You advertise Feldera's utility in ML and AI use cases in addition to data management. What are the features that make it conducive to those applications?What is your philosophy toward the community growth and engagement with the open source aspects of Feldera and how you're balancing that with sustainability of the project and business?What are the most interesting, innovative, or unexpected ways that you have seen Feldera used?What are the most interesting, unexpected, or challenging lessons that

Summary The rapid growth of generative AI applications has prompted a surge of investment in vector databases. While there are numerous engines available now, Lance is designed to integrate with data lake and lakehouse architectures. In this episode Weston Pace explains the inner workings of the Lance format for table definitions and file storage, and the optimizations that they have made to allow for fast random access and efficient schema evolution. In addition to integrating well with data lakes, Lance is also a first-class participant in the Arrow ecosystem, making it easy to use with your existing ML and AI toolchains. This is a fascinating conversation about a technology that is focused on expanding the range of options for working with vector data. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementImagine catching data issues before they snowball into bigger problems. That’s what Datafold’s new Monitors do. With automatic monitoring for cross-database data diffs, schema changes, key metrics, and custom data tests, you can catch discrepancies and anomalies in real time, right at the source. Whether it’s maintaining data integrity or preventing costly mistakes, Datafold Monitors give you the visibility and control you need to keep your entire data stack running smoothly. Want to stop issues before they hit production? Learn more at dataengineeringpodcast.com/datafold today!Your host is Tobias Macey and today I'm interviewing Weston Pace about the Lance file and table format for column-oriented vector storageInterview IntroductionHow did you get involved in the area of data management?Can you describe what Lance is and the story behind it?What are the core problems that Lance is designed to solve?What is explicitly out of scope?The README mentions that it is straightforward to convert to Lance from Parquet. What is the motivation for this compatibility/conversion support?What formats does Lance replace or obviate?In terms of data modeling Lance obviously adds a vector type, what are the features and constraints that engineers should be aware of when modeling their embeddings or arbitrary vectors?Are there any practical or hard limitations on vector dimensionality?When generating Lance files/datasets, what are some considerations to be aware of for balancing file/chunk sizes for I/O efficiency and random access in cloud storage?I noticed that the file specification has space for feature flags. How has that aided in enabling experimentation in new capabilities and optimizations?What are some of the engineering and design decisions that were most challenging and/or had the biggest impact on the performance and utility of Lance?The most obvious interface for reading and writing Lance files is through LanceDB. Can you describe the use cases that it focuses on and its notable features?What are the other main integrations for Lance?What are the opportunities or roadblocks in adding support for Lance and vector storage/indexes in e.g. Iceberg or Delta to enable its use in data lake environments?What are the most interesting, innovative, or unexpected ways that you have seen Lance used?What are the most interesting, unexpected, or challenging lessons that you have learned while working on the Lance format?When is Lance the wrong choice?What do you have planned for the future of Lance?Contact Info LinkedInGitHubParting Question From your perspective, what is the biggest gap in the tooling or technology for data management today?Links Lance FormatLanceDBSubstraitPyArrowFAISSPineconePodcast EpisodeParquetIcebergPodcast EpisodeDelta LakePodcast EpisodePyLanceHilbert CurvesSIFT VectorsS3 ExpressWekaDataFusionRay DataTorch Data LoaderHNSW == Hierarchical Navigable Small Worlds vector indexIVFPQ vector indexGeoJSONPolarsThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Summary In this episode of the Data Engineering Podcast, host Tobias Macey welcomes back Chris Berg, CEO of DataKitchen, to discuss his ongoing mission to simplify the lives of data engineers. Chris explains the challenges faced by data engineers, such as constant system failures, the need for rapid changes, and high customer demands. Chris delves into the concept of DataOps, its evolution, and the misappropriation of related terms like data mesh and data observability. He emphasizes the importance of focusing on processes and systems rather than just tools to improve data engineering workflows. Chris also introduces DataKitchen's open-source tools, DataOps TestGen and DataOps Observability, designed to automate data quality validation and monitor data journeys in production. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementData lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst is an end-to-end data lakehouse platform built on Trino, the query engine Apache Iceberg was designed for, with complete support for all table formats including Apache Iceberg, Hive, and Delta Lake. Trusted by teams of all sizes, including Comcast and Doordash. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino.Your host is Tobias Macey and today I'm interviewing Chris Bergh about his tireless quest to simplify the lives of data engineersInterview IntroductionHow did you get involved in the area of data management?Can you describe what DataKitchen is and the story behind it?You helped to define and popularize "DataOps", which then went through a journey of misappropriation similar to "DevOps", and has since faded in use. What is your view on the realities of "DataOps" today?Out of the popularized wave of "DataOps" tools came subsequent trends in data observability, data reliability engineering, etc. How have those cycles influenced the way that you think about the work that you are doing at DataKitchen?The data ecosystem went through a massive growth period over the past ~7 years, and we are now entering a cycle of consolidation. What are the fundamental shifts that we have gone through as an industry in the management and application of data?What are the challenges that never went away?You recently open sourced the dataops-testgen and dataops-observability tools. What are the outcomes that you are trying to produce with those projects?What are the areas of overlap with existing tools and what are the unique capabilities that you are offering?Can you talk through the technical implementation of your new obserability and quality testing platform?What does the onboarding and integration process look like?Once a team has one or both tools set up, what are the typical points of interaction that they will have over the course of their workday?What are the most interesting, innovative, or unexpected ways that you have seen dataops-observability/testgen used?What are the most interesting, unexpected, or challenging lessons that you have learned while working on promoting DataOps?What do you have planned for the future of your work at DataKitchen?Contact Info LinkedInParting Question From your perspective, what is the biggest gap in the tooling or technology for data management today?Links DataKitchenPodcast EpisodeNASADataOps ManifestoData Reliability EngineeringData ObservabilitydbtDevOps Enterprise SummitBuilding The Data Warehouse by Bill Inmon (affiliate link)dataops-testgen, dataops-observabilityFree Data Quality and Data Observability CertificationDatabricksDORA MetricsDORA for dataThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA