talk-data.com talk-data.com

Topic

Iceberg

Apache Iceberg

table_format data_lake schema_evolution file_format storage open_table_format

206

tagged

Activity Trend

39 peak/qtr
2020-Q1 2026-Q1

Activities

206 activities · Newest first

Tristan digs deep into the world of Apache Iceberg. There's a lot happening beneath the surface: multiple catalog interfaces, evolving REST specs, and competing implementations across open source, proprietary, and academic contexts. Christian Thiel, co-founder of Lakekeeper, one of the most widely used Iceberg catalogs, joins to walk through the state of the Iceberg ecosystem. For full show notes and to read 6+ years of back issues of the podcast's companion newsletter, head to https://roundup.getdbt.com. The Analytics Engineering Podcast is sponsored by dbt Labs.

Summary In this episode of the Data Engineering Podcast Prashanth Rao, an AI engineer at KuzuDB, talks about their embeddable graph database. Prashanth explains how KuzuDB addresses performance shortcomings in existing solutions through columnar storage and novel join algorithms. He discusses the usability and scalability of KuzuDB, emphasizing its open-source nature and potential for various graph applications. The conversation explores the growing interest in graph databases due to their AI and data engineering applications, and Prashanth highlights KuzuDB's potential in edge computing, ephemeral workloads, and integration with other formats like Iceberg and Parquet.

Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementData migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details.Your host is Tobias Macey and today I'm interviewing Prashanth Rao about KuzuDB, an embeddable graph databaseInterview IntroductionHow did you get involved in the area of data management?Can you describe what KuzuDB is and the story behind it?What are the core use cases that Kuzu is focused on addressing?What is explicitly out of scope?Graph engines have been available and in use for a long time, but generally for more niche use cases. How would you characterize the current state of the graph data ecosystem?You note scalability as a feature of Kuzu, which is a phrase with many potential interpretations. Typically horizontal scaling of graphs has been complicated, in what sense does Kuzu make that claim?Can you describe some of the typical architecture and integration patterns of Kuzu?What are some of the more interesting or esoteric means of architecting with Kuzu?For cases where Kuzu is rendering a graph across an external data repository (e.g. Iceberg, etc.), what are the patterns for balancing data freshness with network/compute efficiency? (e.g. read and create every time or persist the Kuzu state)Can you describe the internal architecture of Kuzu and key design factors?What are the benefits and tradeoffs of using a columnar store with adjacency lists vs. a more graph-native storage format?What are the most interesting, innovative, or unexpected ways that you have seen Kuzu used?What are the most interesting, unexpected, or challenging lessons that you have learned while working on Kuzu?When is Kuzu the wrong choice?What do you have planned for the future of Kuzu?Contact Info WebsiteLinkedInParting Question From your perspective, what is the biggest gap in the tooling or technology for data management today?Links KuzuDBBERTTransformer ArchitectureDuckDBPodcast EpisodeMonetDBUmbra DBsqliteCypher Query LanguageProperty GraphNeo4JGraphRAGContext EngineeringWrite-Ahead LogBauplanIcebergDuckLakeLanceLanceDBArrowPolarsArrow DataFusionGQLClickHouseAdjacency ListWhy Graph Databases Need New Join AlgorithmsKuzuDB WASMRAG == Retrieval Augmented GenerationNetworkXThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Summary In this episode of the Data Engineering Podcast Andy Warfield talks about the innovative functionalities of S3 Tables and Vectors and their integration into modern data stacks. Andy shares his journey through the tech industry and his role at Amazon, where he collaborates to enhance storage capabilities, discussing the evolution of S3 from a simple storage solution to a sophisticated system supporting advanced data types like tables and vectors crucial for analytics and AI-driven applications. He explains the motivations behind introducing S3 Tables and Vectors, highlighting their role in simplifying data management and enhancing performance for complex workloads, and shares insights into the technical challenges and design considerations involved in developing these features. The conversation explores potential applications of S3 Tables and Vectors in fields like AI, genomics, and media, and discusses future directions for S3's development to further support data-driven innovation.

Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementTired of data migrations that drag on for months or even years? What if I told you there's a way to cut that timeline by up to 6x while guaranteeing accuracy? Datafold's Migration Agent is the only AI-powered solution that doesn't just translate your code; it validates every single data point to ensure perfect parity between your old and new systems. Whether you're moving from Oracle to Snowflake, migrating stored procedures to dbt, or handling complex multi-system migrations, they deliver production-ready code with a guaranteed timeline and fixed price. Stop burning budget on endless consulting hours. Visit dataengineeringpodcast.com/datafold to book a demo and see how they're turning months-long migration nightmares into week-long success stories.Your host is Tobias Macey and today I'm interviewing Andy Warfield about S3 Tables and VectorsInterview IntroductionHow did you get involved in the area of data management?Can you describe what your goals are with the Tables and Vector features of S3?How did the experience of building S3 Tables inform your work on S3 Vectors?There are numerous implementations of vector storage and search. How do you view the role of S3 in the context of that ecosystem?The most directly analogous implementation that I'm aware of is the Lance table format. How would you compare the implementation and capabilities of Lance with what you are building with S3 Vectors?What opportunity do you see for being able to offer a protocol compatible implementation similar to the Iceberg compatibility that you provide with S3 Tables?Can you describe the technical implementation of the Vectors functionality in S3?What are the sources of inspiration that you looked to in designing the service?Can you describe some of the ways that S3 Vectors might be integrated into a typical AI application?What are the most interesting, innovative, or unexpected ways that you have seen S3 Tables/Vectors used?What are the most interesting, unexpected, or challenging lessons that you have learned while working on S3 Tables/Vectors?When is S3 the wrong choice for Iceberg or Vector implementations?What do you have planned for the future of S3 Tables and Vectors?Contact Info LinkedInParting Question From your perspective, what is the biggest gap in the tooling or technology for data management today?Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story.Links S3 TablesS3 VectorsS3 ExpressParquetIcebergVector IndexVector DatabasepgvectorEmbedding ModelRetrieval Augmented GenerationTwelveLabsAmazon BedrockIceberg REST CatalogLog-Structured Merge TreeS3 MetadataSentence TransformerSparkTrinoDaftThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Jumpstart Snowflake: A Step-by-Step Guide to Modern Cloud Analytics

This book is your guide to the modern market of data analytics platforms and the benefits of using Snowflake, the data warehouse built for the cloud. As organizations increasingly rely on modern cloud data platforms, the core of any analytics framework—the data warehouse—is more important than ever. This updated 2nd edition ensures you are ready to make the most of the industry’s leading data warehouse. This book will onboard you to Snowflake and present best practices for deploying and using the Snowflake data warehouse. The book also covers modern analytics architecture, integration with leading analytics software such as Matillion ETL, Tableau, and Databricks, and migration scenarios for on-premises legacy data warehouses. This new edition includes expanded coverage of SnowPark for developing complex data applications, an introduction to managing large datasets with Apache Iceberg tables, and instructions for creating interactive data applications using Streamlit, ensuring readers are equipped with the latest advancements in Snowflake's capabilities. What You Will Learn Master key functionalities of Snowflake Set up security and access with cluster Bulk load data into Snowflake using the COPY command Migrate from a legacy data warehouse to Snowflake Integrate the Snowflake data platform with modern business intelligence (BI) and data integration tools Manage large datasets with Apache Iceberg Tables Implement continuous data loading with Snowpipe and Dynamic Tables Who This Book Is For Data professionals, business analysts, IT administrators, and existing or potential Snowflake users

In this season of the Analytics Engineering podcast, Tristan is deep into the world of developer tools and databases. If you're following us here, you've almost definitely used Amazon S3 it and its Blob Storage siblings. They form the foundation for nearly all data work in the cloud. In many ways, it was the innovations that happened inside of S3 that have unlocked all of the progress in cloud data over the last decade. In this episode, Tristan talks with Andy Warfield, VP and senior principal engineer at AWS, where he focuses primarily on storage. They go deep on S3, how it works, and what it unlocks. They close out italking about Iceberg, S3 table buckets, and what this all suggests about the outlines of the S3 product roadmap moving forward. For full show notes and to read 6+ years of back issues of the podcast's companion newsletter, head to https://roundup.getdbt.com. The Analytics Engineering Podcast is sponsored by dbt Labs.

This is a free preview of a paid episode. To hear more, visit dataengineeringcentral.substack.com

Hello! A new episode of the Data Engineering Central Podcast is dropping today, we will be covering a few hot topics! * Apache Iceberg Catalogs * new Boring Catalog * new full Iceberg support from Databricks/Unity Catalog * Databricks SQL Scripting * DuckDB coming to a Lake House near you * Lakebase from Databricks Going to be a great show, come along for the ride! Thanks …

Everyone makes streaming sound simple – until you try bolting it onto your batch pipeline and it blows up. This talk skips the marketing gloss and gets into the real work: how to make batch and streaming actually play nice. I’ll walk through the essentials, then get into the messy parts – compaction, primary key updates, exactly-once delivery, and keeping your compute bill from spiraling. You’ll learn how to plug RisingWave into your existing stack and get real-time results without rewriting everything. It’s based on what we’ve seen in production – real problems, real fixes, no buzzwords.

Everyone makes streaming sound simple - until you try bolting it onto your batch pipeline and it blows up. This talk skips the marketing gloss and gets into the real work: how to make batch and streaming actually play nice. I will walk through the essentials, then get into the messy parts - compaction, primary key updates, exactly-once delivery, and keeping your compute bill from spiraling. You will learn how to plug RisingWave into your existing stack and get real-time results without rewriting everything. It is based on what we have seen in production - real problems, real fixes, no buzzwords.

Open table formats empower companies to bring compute to their data while maintaining a single copy of data. At Capital One, we use Iceberg Tables to optimize ingestion and storage costs in Snowflake without compromising data integrity or performance. This enables us to expand how we leverage Snowflake's capabilities for diverse workloads. In this session, we'll share how Iceberg helps us democratize compute choices along with key learnings from migrating datasets from native Snowflake tables.

Databricks + Apache Iceberg™: Managed and Foreign Tables in Unity Catalog

Unity Catalog support for Apache Iceberg™ brings open, interoperable table formats to the heart of the Databricks Lakehouse. In this session, we’ll introduce new capabilities that allow you to write Iceberg tables from any REST-compatible engine, apply fine-grained governance across all data, and unify access to external Iceberg catalogs like AWS Glue, Hive Metastore, and Snowflake Horizon. Learn how Databricks is eliminating data silos, simplifying performance with Predictive Optimization, and advancing a truly open lakehouse architecture with Delta and Iceberg side by side.

Iceberg Geo Type: Transforming Geospatial Data Management at Scale

The Apache Iceberg™ community is introducing native geospatial type support, addressing key challenges in managing geospatial data at scale, including fragmented formats and inefficiencies in storing large spatial datasets. This talk will delve into the origins of the Iceberg geo type, its specification design and future goals. We will examine the impact on both the geospatial and Iceberg communities, in introducing a standard data warehouse storage layer to the geospatial community, and enabling optimized geospatial analytics for Iceberg users. We will also present a live demonstration of the Iceberg geo data type with Apache Sedona™ and Apache Spark™, showcasing how it simplifies and accelerates geospatial analytics workflows and queries. Finally, we will also provide an in-depth look at its current capabilities and outline the roadmap for future developments, and offer a perspective on its role in advancing geospatial data management in the industry.

lightning_talk
by Robert Pack (Databricks) , Denny Lee (Databricks) , Tyler Croy (Scribd, Inc.)

Join us for an in-depth Ask Me Anything (AMA) on how Rust is revolutionizing Lakehouse formats like Delta Lake and Apache Iceberg through projects like delta-rs and iceberg-rs! Discover how Rust’s memory safety, zero-cost abstractions and fearless concurrency unlock faster development and higher-performance data operations. Whether you’re a data engineer, Rustacean or Lakehouse enthusiast, bring your questions on how Rust is shaping the future of open table formats!

Sponsored by: Redpanda | IoT for Fun & Prophet: Scaling IoT and predicting the future with Redpanda, Iceberg & Prophet

In this talk, we’ll walk through a complete real-time IoT architecture—from an economical, high-powered ESP32 microcontroller publishing environmental sensor data to AWS IoT, through Redpanda Connect into a Redpanda BYOC cluster, and finally into Apache Iceberg for long-term analytical storage. Once the data lands, we’ll query it using Python and perform linear regression with Prophet to forecast future trends. Along the way, we’ll explore the design of a scalable, cloud-native pipeline for streaming IoT data. Whether you're tracking the weather or building the future, this session will help you architect with confidence—and maybe even predict it.

The Future of Open Table Formats: Delta Lake, Iceberg, and More

Open table formats are evolving quickly. In this session, we’ll explore the latest features of Delta Lake and Apache Iceberg™ , including a look at the emerging Iceberg v3 specification. Join us to learn about what’s driving format innovation, how interoperability is becoming real, and what it means for the future of data architecture.

Incremental Iceberg Table Replication at Scale

Apache Iceberg is a popular table format for managing large analytical datasets. But replicating iceberg tables at scale can be a daunting task — especially when dealing with its hierarchical metadata. In this talk, we present an end-to-end workflow for replicating Apache Iceberg tables, leveraging Apache Spark to ensure that backup tables remain identical to their source counterparts. More excitingly, we have contributed these libraries back to the open-source community. Attendees will gain a comprehensive understanding of how to set up replication workflows for Iceberg tables, as well as practical guidance on how to manage and maintain replicated datasets at scale. This talk is ideal for data engineers, platform architects and practitioners looking to apply replication and disaster recovery for Apache Iceberg in complex data ecosystems.

Sponsored by: Google Cloud | Powering AI & Analytics: Innovations in Google Cloud Storage for Data Lakes

Enterprise customers need a powerful and adaptable data foundation to navigate demands of AI and multi-cloud environments. This session dives into how Google Cloud Storage serves as a unified platform for modern analytics data lakes, together with Databricks. Discover how Google Cloud Storage provides key innovations like performance optimizations for Apache Iceberg, Anywhere Cache as the easiest way to colocate storage and compute, Rapid Storage for ultra low latency object reads and appends, and Storage Intelligence for vital data insights and recommendations. Learn how you can optimize your infrastructure to unlock the full value of your data for AI-driven success.