Tal Sofer

Activities

1

talks

Product Manager - Treeverse

Filter by Event / Source

Big Data LDN 2024 1

Talks & appearances

1 activities · Newest first

Search activities →

Dataset Versioning in the Age of Open Table Formats

2024-09-18 · Big Data LDN 2024

Face To Face

Data Management Delta Git Hudi Iceberg OTF

Open table formats such as Apache Iceberg, Delta Lake, and Apache Hudi have dramatically transformed the data management landscape by enabling high-speed operations on massive datasets stored in object stores while maintaining ACID guarantees.

In this talk, we will explore the evolution and future of dataset versioning in the context of open table formats. Open table formats introduced the concept of table-level versioning and have become widely adopted standards. Data versioning systems that have emerged more recently, bringing best practices from software engineering into the data ecosystem, enable the management of multiple datasets within a large-scale data repository using Git-like semantics. Data versioning systems operate at the file level and are compatible with any open table format. On top of this, new catalogs that support these table formats and add a layer of access control are becoming the standard way to manage tabular datasets.

Despite these advancements, there remains a significant gap between current data versioning practices and the requirements for effective tabular dataset versioning.

The session will introduce the concept of a versioned catalog as a solution, demonstrating how it provides comprehensive data and metadata versioning for tables.

We’ll cover key requirements of tabular dataset management, including:

Capturing multi-table changes as single logical operations
Enabling seamless rollbacks without identifying each affected table
Implementing table format-aware versioning operations such as diff and merge

Join us to explore the future of dataset versioning in the era of open table formats and evolving data management practices!