talk-data.com talk-data.com

Topic

Iceberg

Apache Iceberg

table_format data_lake schema_evolution file_format storage open_table_format

48

tagged

Activity Trend

39 peak/qtr
2020-Q1 2026-Q1

Activities

48 activities · Newest first

AWS re:Invent 2025 - Best practices for building Apache Iceberg based lakehouse architectures on AWS

Discover advanced strategies for implementing Apache Iceberg on AWS, focusing on Amazon S3 Tables and integration of Iceberg Rest Catalog with the lakehouse in Amazon SageMaker. We'll cover performance optimization techniques for Amazon Athena and Amazon Redshift queries, real-time processing using Apache Spark, and integration with Amazon EMR, AWS Glue, and Trino. Explore practical implementations of zero-ETL, change data capture (CDC) patterns, and medallion architecture. Gain hands-on expertise in implementing enterprise-grade lakehouse solutions with Iceberg on AWS.

Learn more: More AWS events: https://go.aws/3kss9CP

Subscribe: More AWS videos: http://bit.ly/2O3zS75 More AWS events videos: http://bit.ly/316g9t4

ABOUT AWS: Amazon Web Services (AWS) hosts events, both online and in-person, bringing the cloud computing community together to connect, collaborate, and learn from AWS experts. AWS is the world's most comprehensive and broadly adopted cloud platform, offering over 200 fully featured services from data centers globally. Millions of customers—including the fastest-growing startups, largest enterprises, and leading government agencies—are using AWS to lower costs, become more agile, and innovate faster.

AWSreInvent #AWSreInvent2025 #AWS

AWS re:Invent 2025 - Accelerate analytics and AI w/ an open and secure lakehouse architecture-ANT309

Data lakes, data warehouses, or both? Join this session to explore how to build a unified, open, and secure data lakehouse architecture, fully compatible with Apache Iceberg, in Amazon SageMaker. Learn how the lakehouse breaks down data silos and opens your data estate offering flexibility to use your preferred query engines and tools that accelerate time to insights. Learn about recent launches that improve data interoperability and performance, and enable large language models (LLMs) and AI agents to interact with your data. Discover robust security features, including consistent fine-grained access controls, attribute-based access control, and tag-based access control that help democratize data without compromises.

Learn more: More AWS events: https://go.aws/3kss9CP

Subscribe: More AWS videos: http://bit.ly/2O3zS75 More AWS events videos: http://bit.ly/316g9t4

ABOUT AWS: Amazon Web Services (AWS) hosts events, both online and in-person, bringing the cloud computing community together to connect, collaborate, and learn from AWS experts. AWS is the world's most comprehensive and broadly adopted cloud platform, offering over 200 fully featured services from data centers globally. Millions of customers—including the fastest-growing startups, largest enterprises, and leading government agencies—are using AWS to lower costs, become more agile, and innovate faster.

AWSreInvent #AWSreInvent2025 #AWS

AWS re:Invent 2025 - Using graphs over your data lake to power generative AI applications (DAT447)

In this session, learn about new Amazon Neptune capabilities for high-performance graph analytics and queries over data lakes to unlock the implicit and explicit relationships in your data, driving more accurate, trustworthy generative AI responses. We'll demonstrate building knowledge graphs from structured and unstructured data, combining graph algorithms (PageRank, Louvain clustering, path optimization) with semantic search, and executing Cypher queries on Parquet and Iceberg formats in Amazon S3. Through code samples and benchmarks, learn advanced architectures to use Neptune for multi-hop reasoning, entity linking, and context enrichment at scale. This session assumes familiarity with graph concepts and data lake architectures.

Learn more: More AWS events: https://go.aws/3kss9CP

Subscribe: More AWS videos: http://bit.ly/2O3zS75 More AWS events videos: http://bit.ly/316g9t4

ABOUT AWS: Amazon Web Services (AWS) hosts events, both online and in-person, bringing the cloud computing community together to connect, collaborate, and learn from AWS experts. AWS is the world's most comprehensive and broadly adopted cloud platform, offering over 200 fully featured services from data centers globally. Millions of customers—including the fastest-growing startups, largest enterprises, and leading government agencies—are using AWS to lower costs, become more agile, and innovate faster.

AWSreInvent #AWSreInvent2025 #AWS

Best practice for leveraging Amazon Analytic Services + dbt

As organizations increasingly adopt modern data stacks, the combination of dbt and AWS Analytics services emerged as a powerful pairing for analytics engineering at scale. This session will explore proven strategies and hard-learned lessons for optimizing this technology stack to use dbt-athena, dbt-redshift, and dbt-glue to deliver reliable, performant data transformations. We will also cover case studies, best practices, and modern lakehouse scenarios with Apache Iceberg and Amazon S3 Tables.

Mamma mia! My data’s in the Iceberg

Iceberg is an open storage format for large analytical datasets that is now interoperable with most modern data platforms. But the setup is complicated, and caveats abound. Jeremy Cohen will tour the archipelago of Iceberg integrations — across data warehouses, catalogs, and dbt — and demonstrate the promise of cross platform dbt Mesh to provide flexibility and collaboration for data teams. The more the merrier.

Below the tip of the Iceberg: How Wikimedia reduced reporting latency 10x using dbt and Iceberg

Learn how the Wikimedia Foundation implemented an on-prem, open source data lake to fund Wikipedia and the future of open knowledge. We'll discuss data architecture including challenges integrating open source tools, learnings from our implementation, how we achieved a 10x decrease in query run times, and more.

What’s new in the dbt language across Core and Fusion

The dbt language is growing to support new workflows across both dbt Core and the dbt Fusion engine. In this session, we’ll walk through the latest updates to dbt—from sample mode to iceberg catalogs to UDFs—showing how they work across different engines. You’ll also learn how to track the roadmap, contribute to development, and stay connected to the future of dbt.

Unleash the power of dbt on Google Cloud: BigQuery, Iceberg, DataFrames and beyond

The data world has long been divided, with data engineers and data scientists working in silos. This fragmentation creates a long, difficult journey from raw data to machine learning models. We've unified these worlds through the Google Cloud and dbt partnership. In this session, we'll show you an end-to-end workflow that simplifies data to AI journey. The availability of dbt Cloud on Google Cloud Marketplace streamlines getting started, and its integration with BigQuery's new Apache Iceberg tables creates an open foundation. We'll also highlight how BigQuery DataFrames' integration with dbt Python models lets you perform complex data science at scale, all within a single, streamlined process. Join us to learn how to build a unified data and AI platform with dbt on Google Cloud.

Databricks + Apache Iceberg™: Managed and Foreign Tables in Unity Catalog

Unity Catalog support for Apache Iceberg™ brings open, interoperable table formats to the heart of the Databricks Lakehouse. In this session, we’ll introduce new capabilities that allow you to write Iceberg tables from any REST-compatible engine, apply fine-grained governance across all data, and unify access to external Iceberg catalogs like AWS Glue, Hive Metastore, and Snowflake Horizon. Learn how Databricks is eliminating data silos, simplifying performance with Predictive Optimization, and advancing a truly open lakehouse architecture with Delta and Iceberg side by side.

Iceberg Geo Type: Transforming Geospatial Data Management at Scale

The Apache Iceberg™ community is introducing native geospatial type support, addressing key challenges in managing geospatial data at scale, including fragmented formats and inefficiencies in storing large spatial datasets. This talk will delve into the origins of the Iceberg geo type, its specification design and future goals. We will examine the impact on both the geospatial and Iceberg communities, in introducing a standard data warehouse storage layer to the geospatial community, and enabling optimized geospatial analytics for Iceberg users. We will also present a live demonstration of the Iceberg geo data type with Apache Sedona™ and Apache Spark™, showcasing how it simplifies and accelerates geospatial analytics workflows and queries. Finally, we will also provide an in-depth look at its current capabilities and outline the roadmap for future developments, and offer a perspective on its role in advancing geospatial data management in the industry.

Sponsored by: Redpanda | IoT for Fun & Prophet: Scaling IoT and predicting the future with Redpanda, Iceberg & Prophet

In this talk, we’ll walk through a complete real-time IoT architecture—from an economical, high-powered ESP32 microcontroller publishing environmental sensor data to AWS IoT, through Redpanda Connect into a Redpanda BYOC cluster, and finally into Apache Iceberg for long-term analytical storage. Once the data lands, we’ll query it using Python and perform linear regression with Prophet to forecast future trends. Along the way, we’ll explore the design of a scalable, cloud-native pipeline for streaming IoT data. Whether you're tracking the weather or building the future, this session will help you architect with confidence—and maybe even predict it.

The Future of Open Table Formats: Delta Lake, Iceberg, and More

Open table formats are evolving quickly. In this session, we’ll explore the latest features of Delta Lake and Apache Iceberg™ , including a look at the emerging Iceberg v3 specification. Join us to learn about what’s driving format innovation, how interoperability is becoming real, and what it means for the future of data architecture.

Incremental Iceberg Table Replication at Scale

Apache Iceberg is a popular table format for managing large analytical datasets. But replicating iceberg tables at scale can be a daunting task — especially when dealing with its hierarchical metadata. In this talk, we present an end-to-end workflow for replicating Apache Iceberg tables, leveraging Apache Spark to ensure that backup tables remain identical to their source counterparts. More excitingly, we have contributed these libraries back to the open-source community. Attendees will gain a comprehensive understanding of how to set up replication workflows for Iceberg tables, as well as practical guidance on how to manage and maintain replicated datasets at scale. This talk is ideal for data engineers, platform architects and practitioners looking to apply replication and disaster recovery for Apache Iceberg in complex data ecosystems.

Sponsored by: Google Cloud | Powering AI & Analytics: Innovations in Google Cloud Storage for Data Lakes

Enterprise customers need a powerful and adaptable data foundation to navigate demands of AI and multi-cloud environments. This session dives into how Google Cloud Storage serves as a unified platform for modern analytics data lakes, together with Databricks. Discover how Google Cloud Storage provides key innovations like performance optimizations for Apache Iceberg, Anywhere Cache as the easiest way to colocate storage and compute, Rapid Storage for ultra low latency object reads and appends, and Storage Intelligence for vital data insights and recommendations. Learn how you can optimize your infrastructure to unlock the full value of your data for AI-driven success.

Extending the Lakehouse: Power Interoperable Compute With Unity Catalog Open APIs

The lakehouse is built for storage flexibility, but what about compute? In this session, we’ll explore how Unity Catalog enables you to connect and govern multiple compute engines across your data ecosystem. With open APIs and support for the Iceberg REST Catalog, UC lets you extend access to engines like Trino, DuckDB, and Flink while maintaining centralized security, lineage, and interoperability. We will show how you can get started today working with engines like Apache Spark and Starburst to read and write to UC managed tables with some exciting demos. Learn how to bring flexibility to your compute layer—without compromising control.

Iceberg Table Format Adoption and Unified Metadata Catalog Implementation in Lakehouse Platform

DoorDash Data organization actively adopts LakeHouse paradigm. This presentation describes the methodology which allows to migrate the classic Data Warehouse and Data Lake platforms to unified LakeHouse solution.The objective of this effort include Elimination of excessive data movement.Seamless integration and consolidation of the query engine layers, including Snowflake, Databricks, EMR and Trino.Query performance optimization.Abstracting away complexity of underlying storage layers and table formatsStrategic and justified decision on the Unified Metadata catalog used across varios compute platforms

No More Fragile Pipelines: Kafka and Iceberg the Declarative Way

Moving data between operational systems and analytics platforms is often painful. Traditional pipelines become complex, brittle, and expensive to maintain.Take Kafka and Iceberg: batching on Kafka causes ingestion bottlenecks, while streaming-style writes to Iceberg create too many small Parquet files—cluttering metadata, degrading queries, and increasing maintenance overhead. Frequent updates further strain background table operations, causing retries—even before dealing with schema evolution. But much of this complexity is avoidable. What if Kafka Topics and Iceberg Tables were treated as two sides of the same coin? By establishing a transparent equivalence, we can rethink pipeline design entirely. This session introduces Tableflow—a new approach to bridging streaming and table-based systems. It shifts complexity away from pipelines and into a unified layer, enabling simpler, declarative workflows. We’ll cover schema evolution, compaction, topic-to-table mapping, and how to continuously materialize and optimize thousands of topics as Iceberg tables. Whether modernizing or starting fresh, you’ll leave with practical insights for building resilient, scalable, and future-proof data architectures.

Master Schema Translations in the Era of Open Data Lake

Unity Catalog puts variety of schemas into a centralized repository, now the developer community wants more productivity and automation for schema inference, translation, evolution and optimization especially for the scenarios of ingestion and reverse-ETL with more code generations.Coinbase Data Platform attempts to pave a path with "Schemaster" to interact with data catalog with the (proposed) metadata model to make schema translation and evolution more manageable across some of the popular systems, such as Delta, Iceberg, Snowflake, Kafka, MongoDB, DynamoDB, Postgres...This Lighting Talk covers 4 areas: The complexity and caveats of schema differences among The proposed field-level metadata model, and 2 translation patterns: point-to-point vs hub-and-spoke Why Data Profiling be augmented to enhance schema understanding and translation Integrate it with Ingestion & Reverse-ETL in a Databricks-oriented eco system Takeaway: standardize schema lineage & translation

Cross-Cloud Data Mesh with Delta Sharing and UniForm in Mercedes-Benz

In this presentation, we'll show how we achieved a unified development experience for teams working on Mercedes-Benz Data Platforms in AWS and Azure. We will demonstrate how we implemented Azure to AWS and AWS to Azure data product sharing (using Delta Sharing and Cloud Tokens), integration with AWS Glue Iceberg tables through UniForm and automation to drive everything using Azure DevOps Pipelines and DABs. We will also show how to monitor and track cloud egress costs and how we present a consolidated view of all the data products and relevant cost information. The end goal is to show how customers can offer the same user experience to their engineers and not have to worry about which cloud or region the Data Product lives in. Instead, they can enroll in the data product through self-service and have it available to them in minutes, regardless of where it originates.