AWS re:Invent 2025 - Best practices for building Apache Iceberg based lakehouse architectures on AWS

2025-12-06 · AWS re:Invent 2024 Watch

video

Agile/Scrum Athena AWS Amazon EMR AWS Glue Cloud Computing Data Lakehouse ETL/ELT Redshift S3 Amazon SageMaker Spark +1 more

Discover advanced strategies for implementing Apache Iceberg on AWS, focusing on Amazon S3 Tables and integration of Iceberg Rest Catalog with the lakehouse in Amazon SageMaker. We'll cover performance optimization techniques for Amazon Athena and Amazon Redshift queries, real-time processing using Apache Spark, and integration with Amazon EMR, AWS Glue, and Trino. Explore practical implementations of zero-ETL, change data capture (CDC) patterns, and medallion architecture. Gain hands-on expertise in implementing enterprise-grade lakehouse solutions with Iceberg on AWS.

Learn more: More AWS events: https://go.aws/3kss9CP

Subscribe: More AWS videos: http://bit.ly/2O3zS75 More AWS events videos: http://bit.ly/316g9t4

ABOUT AWS: Amazon Web Services (AWS) hosts events, both online and in-person, bringing the cloud computing community together to connect, collaborate, and learn from AWS experts. AWS is the world's most comprehensive and broadly adopted cloud platform, offering over 200 fully featured services from data centers globally. Millions of customers—including the fastest-growing startups, largest enterprises, and leading government agencies—are using AWS to lower costs, become more agile, and innovate faster.

AWSreInvent #AWSreInvent2025 #AWS

AWS re:Invent 2025 - Accelerate analytics and AI w/ an open and secure lakehouse architecture-ANT309

2025-12-05 · AWS re:Invent 2024 Watch

video

Agile/Scrum AI/ML Analytics AWS Cloud Computing Data Lakehouse LLM Amazon SageMaker Cyber Security

Data lakes, data warehouses, or both? Join this session to explore how to build a unified, open, and secure data lakehouse architecture, fully compatible with Apache Iceberg, in Amazon SageMaker. Learn how the lakehouse breaks down data silos and opens your data estate offering flexibility to use your preferred query engines and tools that accelerate time to insights. Learn about recent launches that improve data interoperability and performance, and enable large language models (LLMs) and AI agents to interact with your data. Discover robust security features, including consistent fine-grained access controls, attribute-based access control, and tag-based access control that help democratize data without compromises.

Learn more: More AWS events: https://go.aws/3kss9CP

Subscribe: More AWS videos: http://bit.ly/2O3zS75 More AWS events videos: http://bit.ly/316g9t4

ABOUT AWS: Amazon Web Services (AWS) hosts events, both online and in-person, bringing the cloud computing community together to connect, collaborate, and learn from AWS experts. AWS is the world's most comprehensive and broadly adopted cloud platform, offering over 200 fully featured services from data centers globally. Millions of customers—including the fastest-growing startups, largest enterprises, and leading government agencies—are using AWS to lower costs, become more agile, and innovate faster.

AWSreInvent #AWSreInvent2025 #AWS

AWS re:Invent 2025 - Using graphs over your data lake to power generative AI applications (DAT447)

2025-12-03 · AWS re:Invent 2024 Watch

video

Agile/Scrum AI/ML Analytics AWS Cloud Computing Data Lake GenAI Parquet S3

In this session, learn about new Amazon Neptune capabilities for high-performance graph analytics and queries over data lakes to unlock the implicit and explicit relationships in your data, driving more accurate, trustworthy generative AI responses. We'll demonstrate building knowledge graphs from structured and unstructured data, combining graph algorithms (PageRank, Louvain clustering, path optimization) with semantic search, and executing Cypher queries on Parquet and Iceberg formats in Amazon S3. Through code samples and benchmarks, learn advanced architectures to use Neptune for multi-hop reasoning, entity linking, and context enrichment at scale. This session assumes familiarity with graph concepts and data lake architectures.

Learn more: More AWS events: https://go.aws/3kss9CP

Subscribe: More AWS videos: http://bit.ly/2O3zS75 More AWS events videos: http://bit.ly/316g9t4

ABOUT AWS: Amazon Web Services (AWS) hosts events, both online and in-person, bringing the cloud computing community together to connect, collaborate, and learn from AWS experts. AWS is the world's most comprehensive and broadly adopted cloud platform, offering over 200 fully featured services from data centers globally. Millions of customers—including the fastest-growing startups, largest enterprises, and leading government agencies—are using AWS to lower costs, become more agile, and innovate faster.

AWSreInvent #AWSreInvent2025 #AWS

Best practice for leveraging Amazon Analytic Services + dbt

2025-10-15 · dbt Coalesce 2025 Watch

talk

by Noritaka Sekiyama (Amazon Web Services (AWS)) , Venkatesh Aravamudan (Amazon Web Services (AWS)) , Neela Kulkarni (AWS)

Analytics Analytics Engineering Athena AWS AWS Glue Data Lakehouse dbt Redshift S3

As organizations increasingly adopt modern data stacks, the combination of dbt and AWS Analytics services emerged as a powerful pairing for analytics engineering at scale. This session will explore proven strategies and hard-learned lessons for optimizing this technology stack to use dbt-athena, dbt-redshift, and dbt-glue to deliver reliable, performant data transformations. We will also cover case studies, best practices, and modern lakehouse scenarios with Apache Iceberg and Amazon S3 Tables.

Mamma mia! My data’s in the Iceberg

2025-10-15 · dbt Coalesce 2025 Watch

talk

by Jeremy Cohen (dbt Labs)

dbt

Iceberg is an open storage format for large analytical datasets that is now interoperable with most modern data platforms. But the setup is complicated, and caveats abound. Jeremy Cohen will tour the archipelago of Iceberg integrations — across data warehouses, catalogs, and dbt — and demonstrate the promise of cross platform dbt Mesh to provide flexibility and collaboration for data teams. The more the merrier.

Below the tip of the Iceberg: How Wikimedia reduced reporting latency 10x using dbt and Iceberg

2025-10-15 · dbt Coalesce 2025 Watch

talk

by Avishua Stein (Wikimedia Foundation (Wikipedia)) , Joseph Mando (Wikimedia Foundation)

Data Lake dbt

Learn how the Wikimedia Foundation implemented an on-prem, open source data lake to fund Wikipedia and the future of open knowledge. We'll discuss data architecture including challenges integrating open source tools, learnings from our implementation, how we achieved a 10x decrease in query run times, and more.

What’s new in the dbt language across Core and Fusion

2025-10-15 · dbt Coalesce 2025 Watch

talk

by Grace Goheen (dbt Labs)

dbt

The dbt language is growing to support new workflows across both dbt Core and the dbt Fusion engine. In this session, we’ll walk through the latest updates to dbt—from sample mode to iceberg catalogs to UDFs—showing how they work across different engines. You’ll also learn how to track the roadmap, contribute to development, and stay connected to the future of dbt.

Unleash the power of dbt on Google Cloud: BigQuery, Iceberg, DataFrames and beyond

2025-10-15 · dbt Coalesce 2025 Watch

talk

by Jobin George (Google Cloud) , Sandeep Karmarkar (Google Cloud)

AI/ML BigQuery Cloud Computing Data Science dbt GCP Python

The data world has long been divided, with data engineers and data scientists working in silos. This fragmentation creates a long, difficult journey from raw data to machine learning models. We've unified these worlds through the Google Cloud and dbt partnership. In this session, we'll show you an end-to-end workflow that simplifies data to AI journey. The availability of dbt Cloud on Google Cloud Marketplace streamlines getting started, and its integration with BigQuery's new Apache Iceberg tables creates an open foundation. We'll also highlight how BigQuery DataFrames' integration with dbt Python models lets you perform complex data science at scale, all within a single, streamlined process. Join us to learn how to build a unified data and AI platform with dbt on Google Cloud.

Quiet on Set: Building an On-Air Sign with Open Source Technologies

2025-09-25 · PyData Amsterdam 2025 Watch

talk

by Danica Fine (Snowflake)

Flink Kafka

Using a Raspberry Pi and a powerful trio of open-source technologies—Apache Kafka, Apache Flink, and Apache Iceberg—learn how to build a custom on-air sign to signal when you're on a call and discover how this same scaffolding can be scaled for millions of users.

Databricks + Apache Iceberg™: Managed and Foreign Tables in Unity Catalog

2025-06-12 · Data + AI Summit 2025 Watch

talk

by Jonathan Brito (Databricks)

AWS AWS Glue Data Lakehouse Databricks Delta Hive Snowflake

Unity Catalog support for Apache Iceberg™ brings open, interoperable table formats to the heart of the Databricks Lakehouse. In this session, we’ll introduce new capabilities that allow you to write Iceberg tables from any REST-compatible engine, apply fine-grained governance across all data, and unify access to external Iceberg catalogs like AWS Glue, Hive Metastore, and Snowflake Horizon. Learn how Databricks is eliminating data silos, simplifying performance with Predictive Optimization, and advancing a truly open lakehouse architecture with Delta and Iceberg side by side.

Iceberg Geo Type: Transforming Geospatial Data Management at Scale

2025-06-12 · Data + AI Summit 2025 Watch

talk

by Szehon Ho (Databricks) , Jia Yu (Wherobots Inc.)

Analytics Data Management DWH Spark

The Apache Iceberg™ community is introducing native geospatial type support, addressing key challenges in managing geospatial data at scale, including fragmented formats and inefficiencies in storing large spatial datasets. This talk will delve into the origins of the Iceberg geo type, its specification design and future goals. We will examine the impact on both the geospatial and Iceberg communities, in introducing a standard data warehouse storage layer to the geospatial community, and enabling optimized geospatial analytics for Iceberg users. We will also present a live demonstration of the Iceberg geo data type with Apache Sedona™ and Apache Spark™, showcasing how it simplifies and accelerates geospatial analytics workflows and queries. Finally, we will also provide an in-depth look at its current capabilities and outline the roadmap for future developments, and offer a perspective on its role in advancing geospatial data management in the industry.

Sponsored by: Redpanda | IoT for Fun & Prophet: Scaling IoT and predicting the future with Redpanda, Iceberg & Prophet

The Future of Open Table Formats: Delta Lake, Iceberg, and More

2025-06-12 · Data + AI Summit 2025 Watch

talk

by Daniel Weeks (Databricks) , Ryan Blue (Tabular)

Delta

Open table formats are evolving quickly. In this session, we’ll explore the latest features of Delta Lake and Apache Iceberg™ , including a look at the emerging Iceberg v3 specification. Join us to learn about what’s driving format innovation, how interoperability is becoming real, and what it means for the future of data architecture.

Incremental Iceberg Table Replication at Scale

2025-06-12 · Data + AI Summit 2025 Watch

talk

by Hongyue Hongyue (Self-Employed) , Szehon Ho (Databricks)

Spark

Apache Iceberg is a popular table format for managing large analytical datasets. But replicating iceberg tables at scale can be a daunting task — especially when dealing with its hierarchical metadata. In this talk, we present an end-to-end workflow for replicating Apache Iceberg tables, leveraging Apache Spark to ensure that backup tables remain identical to their source counterparts. More excitingly, we have contributed these libraries back to the open-source community. Attendees will gain a comprehensive understanding of how to set up replication workflows for Iceberg tables, as well as practical guidance on how to manage and maintain replicated datasets at scale. This talk is ideal for data engineers, platform architects and practitioners looking to apply replication and disaster recovery for Apache Iceberg in complex data ecosystems.

Sponsored by: Google Cloud | Powering AI & Analytics: Innovations in Google Cloud Storage for Data Lakes

Extending the Lakehouse: Power Interoperable Compute With Unity Catalog Open APIs

2025-06-11 · Data + AI Summit 2025 Watch

talk

by Tathagata Das (Databricks) , Michelle Leon (Databricks)

Flink API Data Lakehouse DuckDB Cyber Security Spark Trino

The lakehouse is built for storage flexibility, but what about compute? In this session, we’ll explore how Unity Catalog enables you to connect and govern multiple compute engines across your data ecosystem. With open APIs and support for the Iceberg REST Catalog, UC lets you extend access to engines like Trino, DuckDB, and Flink while maintaining centralized security, lineage, and interoperability. We will show how you can get started today working with engines like Apache Spark and Starburst to read and write to UC managed tables with some exciting demos. Learn how to bring flexibility to your compute layer—without compromising control.

Iceberg Table Format Adoption and Unified Metadata Catalog Implementation in Lakehouse Platform

2025-06-11 · Data + AI Summit 2025 Watch

talk

by Ruotian Wang (Doordash) , Sergey Zavgorodni (DoorDash)

Amazon EMR Data Lake Data Lakehouse Databricks DWH Snowflake Trino

DoorDash Data organization actively adopts LakeHouse paradigm. This presentation describes the methodology which allows to migrate the classic Data Warehouse and Data Lake platforms to unified LakeHouse solution.The objective of this effort include Elimination of excessive data movement.Seamless integration and consolidation of the query engine layers, including Snowflake, Databricks, EMR and Trino.Query performance optimization.Abstracting away complexity of underlying storage layers and table formatsStrategic and justified decision on the Unified Metadata catalog used across varios compute platforms

No More Fragile Pipelines: Kafka and Iceberg the Declarative Way

2025-06-11 · Data + AI Summit 2025 Watch

talk

by Adi Polak (Treeverse)

Analytics Kafka Parquet Data Streaming

Moving data between operational systems and analytics platforms is often painful. Traditional pipelines become complex, brittle, and expensive to maintain.Take Kafka and Iceberg: batching on Kafka causes ingestion bottlenecks, while streaming-style writes to Iceberg create too many small Parquet files—cluttering metadata, degrading queries, and increasing maintenance overhead. Frequent updates further strain background table operations, causing retries—even before dealing with schema evolution. But much of this complexity is avoidable. What if Kafka Topics and Iceberg Tables were treated as two sides of the same coin? By establishing a transparent equivalence, we can rethink pipeline design entirely. This session introduces Tableflow—a new approach to bridging streaming and table-based systems. It shifts complexity away from pipelines and into a unified layer, enabling simpler, declarative workflows. We’ll cover schema evolution, compaction, topic-to-table mapping, and how to continuously materialize and optimize thousands of topics as Iceberg tables. Whether modernizing or starting fresh, you’ll leave with practical insights for building resilient, scalable, and future-proof data architectures.

Master Schema Translations in the Era of Open Data Lake

2025-06-11 · Data + AI Summit 2025 Watch

lightning_talk

by Eric Sun (Coinbase)

Data Lake Databricks Delta DynamoDB ETL/ELT Kafka MongoDB Snowflake postgresql

Unity Catalog puts variety of schemas into a centralized repository, now the developer community wants more productivity and automation for schema inference, translation, evolution and optimization especially for the scenarios of ingestion and reverse-ETL with more code generations.Coinbase Data Platform attempts to pave a path with "Schemaster" to interact with data catalog with the (proposed) metadata model to make schema translation and evolution more manageable across some of the popular systems, such as Delta, Iceberg, Snowflake, Kafka, MongoDB, DynamoDB, Postgres...This Lighting Talk covers 4 areas: The complexity and caveats of schema differences among The proposed field-level metadata model, and 2 translation patterns: point-to-point vs hub-and-spoke Why Data Profiling be augmented to enhance schema understanding and translation Integrate it with Ingestion & Reverse-ETL in a Databricks-oriented eco system Takeaway: standardize schema lineage & translation

Cross-Cloud Data Mesh with Delta Sharing and UniForm in Mercedes-Benz

2025-06-11 · Data + AI Summit 2025 Watch

talk

by Aleksandar Dragojevic (Databricks) , Alexander Summa (Mercedes-Benz Group AG)

AWS AWS Glue Azure Azure DevOps Cloud Computing Delta DevOps

In this presentation, we'll show how we achieved a unified development experience for teams working on Mercedes-Benz Data Platforms in AWS and Azure. We will demonstrate how we implemented Azure to AWS and AWS to Azure data product sharing (using Delta Sharing and Cloud Tokens), integration with AWS Glue Iceberg tables through UniForm and automation to drive everything using Azure DevOps Pipelines and DABs. We will also show how to monitor and track cloud egress costs and how we present a consolidated view of all the data products and relevant cost information. The end goal is to show how customers can offer the same user experience to their engineers and not have to worry about which cloud or region the Data Product lives in. Instead, they can enroll in the data product through self-service and have it available to them in minutes, regardless of where it originates.

talk-data.com

Iceberg

Activity Trend

Top Events

Top Speakers

AWS re:Invent 2025 - Best practices for building Apache Iceberg based lakehouse architectures on AWS

AWSreInvent #AWSreInvent2025 #AWS

AWS re:Invent 2025 - Accelerate analytics and AI w/ an open and secure lakehouse architecture-ANT309

AWSreInvent #AWSreInvent2025 #AWS

AWS re:Invent 2025 - Using graphs over your data lake to power generative AI applications (DAT447)

AWSreInvent #AWSreInvent2025 #AWS

Best practice for leveraging Amazon Analytic Services + dbt

Mamma mia! My data’s in the Iceberg

Below the tip of the Iceberg: How Wikimedia reduced reporting latency 10x using dbt and Iceberg

What’s new in the dbt language across Core and Fusion

Unleash the power of dbt on Google Cloud: BigQuery, Iceberg, DataFrames and beyond

Quiet on Set: Building an On-Air Sign with Open Source Technologies

Databricks + Apache Iceberg™: Managed and Foreign Tables in Unity Catalog

Iceberg Geo Type: Transforming Geospatial Data Management at Scale

Sponsored by: Redpanda | IoT for Fun & Prophet: Scaling IoT and predicting the future with Redpanda, Iceberg & Prophet

The Future of Open Table Formats: Delta Lake, Iceberg, and More

Incremental Iceberg Table Replication at Scale

Sponsored by: Google Cloud | Powering AI & Analytics: Innovations in Google Cloud Storage for Data Lakes

Extending the Lakehouse: Power Interoperable Compute With Unity Catalog Open APIs

Iceberg Table Format Adoption and Unified Metadata Catalog Implementation in Lakehouse Platform

No More Fragile Pipelines: Kafka and Iceberg the Declarative Way

Master Schema Translations in the Era of Open Data Lake

Cross-Cloud Data Mesh with Delta Sharing and UniForm in Mercedes-Benz