talk-data.com talk-data.com

T

Speaker

Tal Sofer

1

talks

Product Manager - Treeverse

Filter by Event / Source

Talks & appearances

1 activities · Newest first

Search activities →

Open table formats such as Apache Iceberg, Delta Lake, and Apache Hudi have dramatically transformed the data management landscape by enabling high-speed operations on massive datasets stored in object stores while maintaining ACID guarantees.

In this talk, we will explore the evolution and future of dataset versioning in the context of open table formats. Open table formats introduced the concept of table-level versioning and have become widely adopted standards. Data versioning systems that have emerged more recently, bringing best practices from software engineering into the data ecosystem, enable the management of multiple datasets within a large-scale data repository using Git-like semantics. Data versioning systems operate at the file level and are compatible with any open table format. On top of this, new catalogs that support these table formats and add a layer of access control are becoming the standard way to manage tabular datasets.

Despite these advancements, there remains a significant gap between current data versioning practices and the requirements for effective tabular dataset versioning.

The session will introduce the concept of a versioned catalog as a solution, demonstrating how it provides comprehensive data and metadata versioning for tables.

We’ll cover key requirements of tabular dataset management, including:

  • Capturing multi-table changes as single logical operations
  • Enabling seamless rollbacks without identifying each affected table
  • Implementing table format-aware versioning operations such as diff and merge

Join us to explore the future of dataset versioning in the era of open table formats and evolving data management practices!