talk-data.com talk-data.com

Topic

Delta

Delta Lake

data_lake acid_transactions time_travel file_format storage

119

tagged

Activity Trend

117 peak/qtr
2020-Q1 2026-Q1

Activities

Showing filtered results

Filtering by: Databricks DATA + AI Summit 2023 ×
The Evolution of Delta Lake from Data + AI Summit 2024

Shant Hovsepian, Chief Technology Officer of Data Warehousing at Databricks explains why Delta Lake is the most adopted open lakehouse format.

Includes: - Delta Lake UniForm GA (support for and compatibility with Hudi, Apache Iceberg, Delta) - Delta Lake Liquid Clustering - Delta Lake production-ready catalog (Iceberg REST API) - The growth and strength of the Delta ecosystem - Delta Kernel - DuckDB integration with Delta - Delta 4.0

Announcing DuckDB Support for Delta Lake and a DuckDB Extension to Unity Catalog - Hannes Mühleisen

Hannes Mühleisen of DuckDB Labs addressed an audience of thousands during his keynote address at Data + AI Summit 2024 in San Francisco. Mühleisen announced DuckDB support for Delta Lake, a new DuckDB Extension to Unity Catalog, and Community Extensions.

Speaker: Hannes Mühleisen, Creator of DuckDB, DuckDB Labs @duckdb @duckdb3282

Data Sharing and Cross-Organization Collaboration. Presented by Matei Zaharia at Data + AI Summit

Speaker: Matei Zaharia, Original Creator of Apache Spark™ and MLflow; Chief Technologist, Databricks

Summary: Data sharing and collaboration are important aspects of the data space. Matei Zaharia explains the evolution of the Databricks data platform to facilitate data sharing and collaboration for customers and their partners.

Delta Sharing allows you to share parts of your table with third parties authorized to view them. Over 16,000 data recipients use Delta Sharing, and 40% are not on Databricks—a testament to the open nature.

Databricks Marketplace has been growing rapidly and now has over 2,000 data listings, making it one of the largest data marketplaces available. New Marketplace partners include T-Mobile, Tableau, Atlassian, Epsilon, Shutterstock and more.

To learn more about Delta Sharing features and the expansion of partner sharing ecosystem, see the recent blog: https://www.databricks.com/blog/whats-new-data-sharing-and-collaboration

Lakehouse Format Interoperability With UniForm. Shant Hovsepian presents at Data + AI Summit 2024

Shant Hovsepian, Chief Technology Officer of Data Warehousing at Databricks, discusses the UniForm data format and its interoperability with other data formats. Shant explains that Delta Lake is the most adopted open lakehouse format.

Speaker: Shant Hovsepian, Chief Technology Officer of Data Warehousing, Databricks

The Best Data Warehouse is a Lakehouse

Reynold Xin, Co-founder and Chief Architect at Databricks, presented during Data + AI Summit 2024 on Databricks SQL and its advancements and how to drive performance improvements with the Databricks Data Intelligence Platform.

Speakers: Reynold Xin, Co-founder and Chief Architect, Databricks Pearl Ubaru, Technical Product Engineer, Databricks

Main Points and Key Takeaways (AI-generated summary)

Introduction of Databricks SQL: - Databricks SQL was announced four years ago and has become the fastest-growing product in Databricks history. - Over 7,000 customers, including Shell, AT&T, and Adobe, use Databricks SQL for data warehousing.

Evolution from Data Warehouses to Lakehouses: - Traditional data architectures involved separate data warehouses (for business intelligence) and data lakes (for machine learning and AI). - The lakehouse concept combines the best aspects of data warehouses and data lakes into a single package, addressing issues of governance, storage formats, and data silos.

Technological Foundations: - To support the lakehouse, Databricks developed Delta Lake (storage layer) and Unity Catalog (governance layer). - Over time, lakehouses have been recognized as the future of data architecture.

Core Data Warehousing Capabilities: - Databricks SQL has evolved to support essential data warehousing functionalities like full SQL support, materialized views, and role-based access control. - Integration with major BI tools like Tableau, Power BI, and Looker is available out-of-the-box, reducing migration costs.

Price Performance: - Databricks SQL offers significant improvements in price performance, which is crucial given the high costs associated with data warehouses. - Databricks SQL scales more efficiently compared to traditional data warehouses, which struggle with larger data sets.

Incorporation of AI Systems: - Databricks has integrated AI systems at every layer of their engine, improving performance significantly. - AI systems automate data clustering, query optimization, and predictive indexing, enhancing efficiency and speed.

Benchmarks and Performance Improvements: - Databricks SQL has seen dramatic improvements, with some benchmarks showing a 60% increase in speed compared to 2022. - Real-world benchmarks indicate that Databricks SQL can handle high concurrency loads with consistent low latency.

User Experience Enhancements: - Significant efforts have been made to improve the user experience, making Databricks SQL more accessible to analysts and business users, not just data scientists and engineers. - New features include visual data lineage, simplified error messages, and AI-driven recommendations for error fixes.

AI and SQL Integration: - Databricks SQL now supports AI functions and vector searches, allowing users to perform advanced analysis and query optimizations with ease. - The platform enables seamless integration with AI models, which can be published and accessed through the Unity Catalog.

Conclusion: - Databricks SQL has transformed into a comprehensive data warehousing solution that is powerful, cost-effective, and user-friendly. - The lakehouse approach is presented as a superior alternative to traditional data warehouses, offering better performance and lower costs.

Data + AI Summit 2024 - Keynote Day 2 - Full
video
by Bilal Aslam (Databricks) , Yejin Choi (University of Washington; AI2) , Darshana Sivakumar (Databricks) , Ryan Blue (Tabular) , Zeashan Pappa (Databricks) , Ali Ghodsi (Databricks) , Reynold Xin (Databricks) , Matei Zaharia (Databricks) , Hannes Mühleisen (DuckDB Labs) , Alexander Booth (Texas Rangers Baseball Club) , Tareef Kawaf (Posit Sofware, PBC)

Speakers: - Alexander Booth, Asst Director of Research & Development, Texas Rangers - Ali Ghodsi, Co-Founder and CEO, Databricks - Bilal Aslam, Sr. Director of Product Management, Databricks - Darshana Sivakumar, Staff Product Manager, Databricks - Hannes Mühleisen, Creator of DuckDB, DuckDB Labs - Matei Zaharia, Chief Technology Officer and Co-Founder, Databricks - Reynold Xin, Chief Architect and Co-Founder, Databricks - Ryan Blue, CEO, Tabular - Tareef Kawaf, President, Posit Software, PBC - Yejin Choi, Sr Research Director Commonsense AI, AI2, University of Washington - Zeashan Pappa, Staff Product Manager, Databricks

About Databricks Databricks is the Data and AI company. More than 10,000 organizations worldwide — including Block, Comcast, Conde Nast, Rivian, and Shell, and over 60% of the Fortune 500 — rely on the Databricks Data Intelligence Platform to take control of their data and put it to work with AI. Databricks is headquartered in San Francisco, with offices around the globe, and was founded by the original creators of Lakehouse, Apache Spark™, Delta Lake and MLflow.

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data… Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Delta-rs, Apache Arrow, Polars, WASM: Is Rust the Future of Analytics?

Rust is a unique language whose traits make it very appealing for data engineering. In this session, we'll walk through the different aspects of the language that make it such a good fit for big data processing including: how it improves performance and how it provides greater safety guarantees and compatibility with a wide range of existing tools that make it well positioned to become a major building block for the future of analytics.

We will also take a hands-on look through real code examples at a few emerging technologies built on top of Rust that utilize these capabilities, and learn how to apply them to our modern lakehouse architecture.

Talk by: Oz Katz

Here’s more to explore: Why the Data Lakehouse Is Your next Data Warehouse: https://dbricks.co/3Pt5unq Lakehouse Fundamentals Training: https://dbricks.co/44ancQs

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Five Things You Didn't Know You Could Do with Databricks Workflows

Databricks workflows has come a long way since the initial days of orchestrating simple notebooks and jar/wheel files. Now we can orchestrate multi-task jobs and create a chain of tasks with lineage and DAG with either fan-in or fan-out among multiple other patterns or even run another Databricks job directly inside another job.

Databricks workflows takes its tag: “orchestrate anything anywhere” pretty seriously and is a truly fully-managed, cloud-native orchestrator to orchestrate diverse workloads like Delta Live Tables, SQL, Notebooks, Jars, Python Wheels, dbt, SQL, Apache Spark™, ML pipelines with excellent monitoring, alerting and observability capabilities as well. Basically, it is a one-stop product for all orchestration needs for an efficient lakehouse. And what is even better is, it gives full flexibility of running your jobs in a cloud-agnostic and cloud-independent way and is available across AWS, Azure and GCP.

In this session, we will discuss and deep dive on some of the very interesting features and will showcase end-to-end demos of the features which will allow you to take full advantage of Databricks workflows for orchestrating the lakehouse.

Talk by: Prashanth Babu

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Multicloud Data Governance on the Databricks Lakehouse

Across industries, a multicloud setup has quickly become the reality for large organizations. Multi-cloud introduces new governance challenges as permissions models often do not translate from one cloud to the other and if they do, are insufficiently granular to accommodate privacy requirements and principles of least privilege. This problem can be especially acute for data and AI workloads that rely on sharing and aggregating large and diverse data sources across business unit boundaries and where governance models need to incorporate assets such as table rows/columns and ML features and models.

In this session, we will provide guidelines on how best to overcome these challenges for companies that have adopted the Databricks Lakehouse as their collaborative space for data teams across the organization, by exploiting some of the unique product features of the Databricks platform. We will focus on a common scenario: a data platform team providing data assets to two different ML teams, one using the same cloud and the other one using a different cloud.

We will explain the step-by-step setup of a unified governance model by leveraging the following components and conventions:

  • Unity Catalog for implementing fine-grained access control across all data assets: files in cloud storage, rows and columns in tables and ML features and models
  • The Databricks Terraform provider to automatically enforce guardrails and permissions across clouds
  • Account level SSO Integration and identity federation to centralize administer access across workspaces
  • Delta sharing to seamlessly propagate changes in provider data sets to consumers in near real-time
  • Centralized audit logging for a unified view on what asset was accessed by whom

Talk by: Ioannis Papadopoulos and Volker Tjaden

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Nebula: The Journey of Scaling Instacart’s Data Pipelines with Apache Spark™ and Lakehouse

Instacart has gone through immense growth during the pandemic and the trend continues. Instacart ads is no exception in this growth story. We have launched many new product lines including display and video ads covering the full advertising funnel to address the increasing demand of our retail partners. We have built advanced models to auto-suggest optimal bidding to increase the ROI for our CPG partners. Advertisers’ trust is the utmost priority and thus the quest to build a top-class ads measurement platform.

Ads data processing requires complex data verifications to update ads serving stats. In ETL pipelines these were implemented through files containing thousands of lines of raw SQL which were hard to scale, test, and iterate upon. Our data engineers used to spend hours testing small changes due to a lack of local testing mechanisms. These pain points stress our need for better tools. After some research, we chose Apache Spark™ as our preferred tool to rebuild ETLs, and the Databricks platform made this move easier. In this session, We'll share our journey to move our pipelines to Spark and Delta Lake on Databricks. With Spark, Scala, and Delta we solved many problems which were slowing the team’s productivity. Some key areas that will be covered include:

  • Modular and composable code
  • Unit testing framework
  • Incremental event processing with spark structured streaming
  • Granular resource tuning for better performance and cost efficacy

Other than the domain business logic, the problems discussed here are quite common for performing data processing at scale. We hope that sharing our learnings will benefit others who are going through similar growth challenges or migrating to Lakehouse.

Talk by: Devlina Das and Arthur Li

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Scaling Deep Learning Using Delta Lake Storage Format on Databricks

Delta Lake is an open-source storage format that can be ideally used for storing large-scale datasets, which can be used for single-node and distributed training of deep learning models. Delta Lake storage format gives deep learning practitioners unique data management capabilities for working with their datasets. The challenge is that, as of now, it’s not possible to use Delta Lake to train PyTorch models directly.

PyTorch community has recently introduced a Torchdata library for efficient data loading. This library supports many formats out of the box, but not Delta Lake. This talk will demonstrate using the Delta Lake storage format for single-node and distributed PyTorch training using the torchdata framework and standalone delta-rs Delta Lake implementation.

Talk by: Michael Shtelma

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Unlocking Near Real Time Data Replication with CDC, Apache Spark™ Streaming, and Delta Lake

Tune into DoorDash's journey to migrate from a flaky ETL system with 24-hour data delays, to standardizing a CDC streaming pattern across more than 150 databases to produce near real-time data in a scalable, configurable, and reliable manner.

During this journey, understand how we use Delta Lake to build a self-serve, read-optimized data lake with data latencies of 15, whilst reducing operational overhead. Furthermore, understand how certain tradeoffs like conceding to a non-real-time system allow for multiple optimizations but still permit for OLTP query use-cases, and the benefits it provides.

Talk by: Ivan Peng and Phani Nalluri

Here’s more to explore: Big Book of Data Engineering: 2nd Edition: https://dbricks.co/3XpPgNV The Data Team's Guide to the Databricks Lakehouse Platform: https://dbricks.co/46nuDpI

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Data Sharing and Beyond with Delta Sharing

Stepping into this brave new digital world we are certain that data will be a central product for many organizations. The way to convey their knowledge and their assets will be through data and analytics. Delta Sharing was the world's first open protocol for secure and scalable real-time data sharing. Through our customer conversations, there is a lot of anticipation of how Delta Sharing can be extended to non-tabular assets, such as machine learning experiments and models.

In this session, we will cover how we extended the Delta Sharing protocol to other sharing workflows, enabling sharing of ML models, arbitrary files and more. The development resulted in Arcuate, a Databricks Labs project with a data sharing flavor. The session will start with the high-level approach and how it can be extended to cover other similar use cases. It will then move to our implementation and how it integrates seamlessly with Databricks-managed Delta Sharing server and notebooks. We finally conclude with lessons learned, and our visions for a future of data sharing and beyond

Talk by: Vuong Nguyen and Milos Colic

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

How Coinbase Built and Optimized SOON, a Streaming Ingestion Framework

Data with low latency is important for real-time incident analysis and metrics. Though we have up-to-date data in OLTP databases, they cannot support those scenarios. Data need to be replicated to a data warehouse to serve queries using GroupBy and Join across multiple tables from different systems. At Coinbase, we designed SOON (Spark cOntinuOus iNgestion) based on Kafka, Kafka Connect, and Apache Spark™ as an incremental table replication solution to replicate tables of any size from any database to Delta Lake in a timely manner. It also supports Kafka events ingestion naturally.

SOON incrementally ingests Kafka events as appends, updates, and deletes to an existing table on Delta Lake. The events are grouped into two categories: CDC (change data capture) events generated by Kafka Connect source connectors, and non-CDC events by the frontend or backend services. Both types can be appended or merged into the Delta Lake. Non-CDC events can be in any format, but CDC events must be in the standard SOON CDC schema. We implemented Kafka Connect SMTs to transform raw CDC events into this standardized format. SOON unifies all streaming ingestion scenarios such that users only need to learn one onboarding experience and the team only needs to maintain one framework.

We care about the ingestion performance. The biggest append-only table onboarded has ingress traffic at hundreds of thousands events per second; the biggest CDC-merge table onboarded has a snapshot size of a few TBs and CDC update traffic at hundreds of thousands events per second. A lot of innovative ideas are incorporated in SOON to improve its performance, such as min-max range merge optimization, KMeans merge optimization, no-update merge for deduplication, generated columns as partitions, etc.

Talk by: Chen Guo

Here’s more to explore: Big Book of Data Engineering: 2nd Edition: https://dbricks.co/3XpPgNV The Data Team's Guide to the Databricks Lakehouse Platform: https://dbricks.co/46nuDpI

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Rapidly Implementing Major Retailer API at the Hershey Company

Accurate, reliable, and timely data is critical for CPG companies to stay ahead in highly competitive retailer relationships, and for a company like the Hershey Company, the commercial relationship with Walmart is one of the most important. The team at Hershey found themselves with a looming deadline for their legacy analytics services and targeted a migration to the brand new Walmart Luminate API. Working in partnership with Advancing Analytics, the Hershey Company leveraged a metadata-driven Lakehouse Architecture to rapidly onboard the new Luminate API, helping the category management teams to overhaul how they measure, predict, and plan their business operations.

In this session, we will discuss the impact Luminate has had on Hershey's business covering key areas such as sales, supply chain, and retail field execution, and the technical building blocks that can be used to rapidly provision business users with the data they need, when they need it. We will discuss how key technologies enable this rapid approach, with Databricks Autoloader ingesting and shaping our data, Delta Streaming processing the data through the lakehouse and Databricks SQL providing a responsive serving layer. The session will include commentary as well as cover the technical journey.

Talk by: Simon Whiteley and Jordan Donmoyer

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Self-Service Data Analytics and Governance at Enterprise Scale with Unity Catalog

This session focuses on one of the first Unity Catalog implementations for a large-scale enterprise. In this scenario, a cloud scale analytics platform with 7500 active users based on the lakehouse approach is used. In addition, there is potential for 1500 further users who are subject to special governance rules. They are consuming more than 600 TB of data stored in Delta Lake - continuously growing at more than 1TB per day. This might grow due to local country data. Therefore, the existing data platform must be extended to enable users to combine global and local data from their countries. A new data management was required, which reflects the strict information security rules at a need to know base. Core requirements are: read only from global data, write into local and share the results.

Due to a very pronounced information security awareness and a lack of the technological possibilities it was not possible to interdisciplinary analyze and exchange data so easy or at all so far. Therefore, a lot of business potential and gains could not be identified and realized.

With the new developments in the technology used and the basis of the lakehouse approach, thanks to Unity Catalog, we were able to develop a solution that could meet high requirements for security and process. And enables globally secured interdisciplinary data exchange and analysis at scale. This solution enables the democratization of the data. This results not only in the ability to gain better insights for business management, but also to generate entirely new business cases or products that require a higher degree of data integration and encourage the culture to change. We highlight technical challenges and solutions, present best practices and point out benefits of implementing Unity catalog for enterprises.

Talk by: Artem Meshcheryakov and Pascal van Bellen

Here’s more to explore: Data, Analytics, and AI Governance: https://dbricks.co/44gu3YU

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Streaming Schema Drift Discovery and Controlled Mitigation

When creating streaming workloads with Databricks, it can sometimes be difficult to capture and understand the current structure of your source data. For example, what happens if you are ingesting JSON events from a vendor, and the keys are very sparsely populated, or contain dynamic content? Ideally, data engineers want to "lock in" a target schema in order to minimize complexity and maximize performance for known access patterns. What do you do when your data sources just don't cooperate with that vision? The first step is to quantify how far your current source data is drifting from your established Delta table. But how?

This session will demonstrate a way to capture and visual drift across all your streaming tables. The next question is, "Now that I see all of the data I'm missing, how do I selectively promote some of these keys into DataFrame columns?" The second half of this session will demonstrate precisely how to do a schema migration with minimal job downtime.

Talk by: Alexander Vanadio

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Using Cisco Spaces Firehose API as a Stream of Data for Real-Time Occupancy Modeling

Honeywell manages the control of equipment for hundreds of thousands of buildings worldwide. Many of our outcomes relating to energy and comfort rely on knowing where people are in the building at any one time. This is so we can target health and comfort conditions more suitably to areas where are more densely populated. Many of these buildings have Cisco IT infrastructure in them. Using their WIFI points and the RSSI signal strength from people’s laptops and phones, Cisco can calculate the number of people in each area of the building. Cisco Spaces offer this data up as a real-time streaming source. Honeywell HBT has utilized this stream of data by writing delta live table pipelines to consume this data source.

Honeywell buildings can now receive this firehose data from hundreds of concurrent customers and provide this occupancy data as a service to our vertical offerings in commercial, health, real estate and education. We will discuss the benefits of using DLT to handle this sort of incoming stream data, and illustrate the pain points we had and the resolutions we undertook in successfully receiving the stream of Cisco data. We will illustrate how our DLT pipeline was designed, and how it scaled to deal with huge quantities of real-time streaming data.

Talk by: Paul Mracek and Chris Inkpen

Here’s more to explore: Big Book of Data Engineering: 2nd Edition: https://dbricks.co/3XpPgNV The Data Team's Guide to the Databricks Lakehouse Platform: https://dbricks.co/46nuDpI

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Databricks and Delta Lake: Lessons Learned from Building Akamai's Web Security Analytics Product

Akamai is a leading content delivery network (CDN) and cybersecurity company operating hundreds of thousands of servers in more than 135 countries worldwide. In this session, we will share our experiences and lessons learned from building and maintaining the Web Security Analytics (WSA) product, an interactive analytics platform powered by Databricks and Delta Lake that enables customers to efficiently analyze and take informed action on a high volume of streaming security events.

The WSA platform must be able to serve hundreds of queries per minute, scanning hundreds of terabytes of data from a six petabyte data lake, with most queries returning results within ten seconds; for both aggregation queries and needle in a haystack queries. This session will cover how to use Databricks SQL warehouses and job clusters cost-effectively, and how to improve query performance using tools and techniques such as Delta Lake, Databricks Photon, and partitioning. This talk will be valuable for anyone looking to build and operate a high-performance analytics platform.

Talk by: Tomer Patel and Itai Yaffe

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc