What’s the big deal about Apache Iceberg anyway? "Might Iceberg solve problems for my team?" "I’m using Iceberg already, but I find it lacking in key areas!" If you have any of the above thoughts, this peer exchange is for you! Last year’s peer exchange on Apache Iceberg was standing room only given all the hype surrounding the open table format. However, when participants were asked asked when they might start testing Iceberg capabilities, most said: “wait at least a few months for the dust to settle”. So now we’re a year later, the dust has settled, adoption of Iceberg by analytics engineers continue to grow. But, there’s still some open questions and product integrations to be built. Join your peers in socially constructing knowledge that’ll inform you for the year to come and beyond!
talk-data.com
Topic
OTF
Open Table Format (OTF)
3
tagged
Activity Trend
Top Speakers
Data is the backbone of modern decision-making, but centralizing it is only the tip of the iceberg. Entitlements, secure sharing and just-in-time availability are critical challenges to any large-scale platform. Join Goldman Sachs as we reveal how our Legend Lakehouse, coupled with Databricks, overcomes these hurdles to deliver high-quality, governed data at scale. By leveraging an open table format (Apache Iceberg) and open catalog format (Unity Catalog), we ensure platform interoperability and vendor neutrality. Databricks Unity Catalog then provides a robust entitlement system that aligns with our data contracts, ensuring consistent access control across producer and consumer workspaces. Finally, Legend functions, integrating with Databricks User Defined Functions (UDF), offer real-time data enrichment and secure transformations without exposing raw datasets. Discover how these components unite to streamline analytics, bolster governance and power innovation.
Open table formats such as Apache Iceberg, Delta Lake, and Apache Hudi have dramatically transformed the data management landscape by enabling high-speed operations on massive datasets stored in object stores while maintaining ACID guarantees.
In this talk, we will explore the evolution and future of dataset versioning in the context of open table formats. Open table formats introduced the concept of table-level versioning and have become widely adopted standards. Data versioning systems that have emerged more recently, bringing best practices from software engineering into the data ecosystem, enable the management of multiple datasets within a large-scale data repository using Git-like semantics. Data versioning systems operate at the file level and are compatible with any open table format. On top of this, new catalogs that support these table formats and add a layer of access control are becoming the standard way to manage tabular datasets.
Despite these advancements, there remains a significant gap between current data versioning practices and the requirements for effective tabular dataset versioning.
The session will introduce the concept of a versioned catalog as a solution, demonstrating how it provides comprehensive data and metadata versioning for tables.
We’ll cover key requirements of tabular dataset management, including:
- Capturing multi-table changes as single logical operations
- Enabling seamless rollbacks without identifying each affected table
- Implementing table format-aware versioning operations such as diff and merge
Join us to explore the future of dataset versioning in the era of open table formats and evolving data management practices!