talk-data.com talk-data.com

Topic

Delta

Delta Lake

data_lake acid_transactions time_travel file_format storage

347

tagged

Activity Trend

117 peak/qtr
2020-Q1 2026-Q1

Activities

347 activities · Newest first

High Volume Intelligent Streaming with Sub-Minute SLA for Near Real-Time Data Replication

Attend this session and learn about an innovative solution built around Databricks structured streaming and Delta Live Tables (DLT) to replicate thousands of tables from on-premises to cloud-based relational databases. A highly desirable pattern for many enterprises across the industries to replicate on-premises data to cloud-based data lakes and data stores in near real time for consumption.

This powerful architecture can offload legacy platform workloads and accelerate cloud journey. The intelligent cost-efficient solution leverages thread-pools, multi-task jobs, Kafka, Apache Spark™ structured streaming and DLT. This session will go into detail about problems, solutions, lessons-learned and best practices.

Talk by: Suneel Konidala and Murali Madireddi

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Introducing Universal Format: Iceberg and Hudi Support in Delta Lake

In this session, we will talk about how Delta Lake plans to integrate with Iceberg and Hudi. Customers are being forced to choose storage formats based on the tools that support them rather than choosing the most performant and functional format for their lakehouse architecture. With Universal Format (“UniForm”), Delta removes the need to make this compromise and makes Delta tables compatible with Iceberg and Hudi query engines. We will do a technical deep dive of the technology, demo it, and discuss the roadmap.

Talk by: Himanshu Raja and Ryan Johnson

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Journey Towards Uniting Metastores

This talk will provide a brief overview about Nationwide’s journey towards implementing Unity Catalog at an enterprise level. We will cover the following topics:

Identity management structure Compute framework Naming standards and usage best practices And a little bit about how Delta Sharing will help us ingest third-party data

Unity Catalog has been a core feature towards strengthening Lakehouse architecture for multiple business units.

Speaker: Ananya Ghosh

Monetizing Data Assets: Sharing Data, Models and Features

Data is an asset. Selling/sharing data has largely been solved, and hosted models exist (example: ChatGPT), but moving sensitive data across the public internet or across clouds is problematic. Sharing features (the result of feature engineering) can be monetized for new potential revenue streams. Sharing models can also be monetized while avoiding the transfer of sensitive data.

This session will walk through a few examples of how to share models and features to generate new revenue streams using Delta Sharing, MLflow, and Databricks

Talk by: Keith Anderson and Avinash Sooriyarachchi

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Sponsored: Ascent IO | Publish a Data Mesh Product in Under 10 Minutes w/ Delta Sharing & Ascend

Learn how to quickly ingest, transform and share data in Delta Lake with intelligent data pipelines on Ascend. Using live data, we'll cover everything you need to know to get your first data products up and running fast. We'll talk about first principles for building a scalable mesh and tips for reducing maintenance work as you grow. And you'll see how Ascend applies patented fingerprinting technology to manage change across your interconnected pipelines as you build out the mesh.

Talk by: Jon Osborn

Here’s more to explore: A New Approach to Data Sharing: https://dbricks.co/44eUnT1

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Delta Kernel: Simplifying Building Connectors for Delta

Since the release of Delta 2.0, the project has been growing at a breakneck speed. In this session, we will cover all the latest capabilities that makes Delta Lake the best format for the lakehouse. Based on lessons learned from this past year, we will introduce Project Aqueduct and how we will simplify building Delta Lake APIs from Rust and Go to Trino, Flink, and PySpark.

Talk by: Tathagata Das and Denny Lee

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Writing Data-Sharing Apps Using Node.js and Delta Sharing

JavaScript remains the top programming language today with most code repositories written using JavaScript on GitHub. However, JavaScript is evolving beyond just a language for web application development into a language built for tomorrow. Everyday tasks like data wrangling, data analysis, and predictive analytics are possible today directly from a web browser. For example, many popular data analytics libraries, like Tensorflow.js, now support JavaScript SDKs.

Another popular library, Danfo.js, makes it possible to wrangle data using familiar pandas-like operations, shortening the learning curve and arming the typical data engineer or data scientist with another data tool in their toolbox. In this presentation, we’ll explore using the Node.js connector for Delta Sharing to build a data analytics app that summarizes a Twitter dataset.

Talk by: Will Girten

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Apache Spark™ Streaming and Delta Live Tables Accelerates KPMG Clients For Real Time IoT Insights

Unplanned downtime in manufacturing costs firms up to a trillion dollars annually. Time that materials spend sitting on a production line is lost revenue. Even just 15 hours of downtime a week adds up to over 800 hours of downtime yearly. The use of Internet of Things or IoT devices can cut this time down by providing details of machine metrics. However, IoT predictive maintenance is challenged by the lack of effective, scalable infrastructure and machine learning solutions. IoT data can be the size of multiple terabytes per day and can come in a variety of formats. Furthermore, without any insights and analysis, this data becomes just another table.

The KPMG Databricks IoT Accelerator is a comprehensive solution enabling manufacturing plant operators to have a bird’s eye view of their machines’ health and empowers proactive machine maintenance across their portfolio of IoT devices. The Databricks Accelerator ingests IoT streaming data at scale and implements the Databricks Medallion architecture while leveraging Delta Live Tables to clean and process data. Real time machine learning models are developed from IoT machine measurements and are managed in MLflow. The AI predictions and IoT device readings are compiled in the gold table powering downstream dashboards like Tableau. Dashboards inform machine operators of not only machines’ ailments, but action they can take to mitigate issues before they arise. Operators can see fault history to aid in understanding failure trends, and can filter dashboards by fault type, machine, or specific sensor reading. 

Talk by: MacGregor Winegard

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksin

Sponsored: KPMG | Multicloud Enterprise Delta Sharing & Governance using Unity Catalog @ S&P Global

Cloud technologies have revolutionized global data access across a number of industries. However, many enterprise organizations face challenges in adopting these technologies effectively, as comprehensive cloud data governance strategies and solutions are complex and evolving – particularly in hybrid or multicloud scenarios involving multiple third parties. KPMG and S&P Global have harnessed the power of Databricks Lakehouse to create a novel approach.

By integrating Unity Catalogue, Delta Sharing, and the KPMG Modern Data Platform, S&P Global has enabled scalable, transformative cross-enterprise data sharing and governance. This demonstration highlights a collaboration between S&P Global Sustainable1 (S1) ESG program and the KPMG ESG Analytics Accelerators to enable large-scale SFDR ESG portfolio analytics. Join us to discover our solution that drives transformative change, fosters data-driven decision-making, and bolsters sustainability efforts in a wide range of industries.

Talk by: Niels Hanson,Dennis Tally

Here’s more to explore: A New Approach to Data Sharing: https://dbricks.co/44eUnT1

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Vector Data Lakes

Vector databases such as ElasticSearch and Pinecone offer fast ingestion and querying on vector embeddings with ANNs. However, they typically do not decouple compute and storage, making them hard to integrate in production data stacks. Because data storage in these databases is expensive and not easily accessible, data teams typically maintain ETL pipelines to offload historical embedding data to blob stores. When that data needs to be queried, they get loaded back into the vector database in another ETL process. This is reminiscent of loading data from OLTP database to cloud storage, then loading said data into an OLAP warehouse for offline analytics.

Recently, “lakehouse” offerings allow direct OLAP querying on cloud storage, removing the need for the second ETL step. The same could be done for embedding data. While embedding storage in blob stores cannot satisfy the high TPS requirements in online settings, we argue it’s sufficient for offline analytics use cases like slicing and dicing data based on embedding clusters. Instead of loading the embedding data back into the vector database for offline analytics, we propose direct processing on embeddings stored in Parquet files in Delta Lake. You will see that offline embedding workloads typically touch a large portion of the stored embeddings without the need for random access.

As a result, the workload is entirely bound by network throughput instead of latency, making it quite suitable for blob storage backends. On a test one billion vector dataset, ETL into cloud storage takes around one hour on a dedicated GPU instance, while batched nearest neighbor search can be done in under one minute with four CPU instances. We believe future “lakehouses” will ship with native support for these embedding workloads.

Talk by: Tony Wang and Chang She

Here’s more to explore: State of Data + AI Report: https://dbricks.co/44i2HBp Databricks named a Leader in 2022 Gartner® Magic QuadrantTM CDBMS: https://dbricks.co/3phw20d

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Delta Lake AMA

Have some great questions about Delta Lake?  Well, come by and ask the experts your questions!

Talk by: Bart Samwel, Tathagata Das, Robert Pack, and Allison Portis

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

How We Made a Unified Talent Solution Using Databricks Machine Learning, Fine-Tuned LLM & Dolly 2.0

Using Databricks, we built a “Unified Talent Solution” backed by a robust data and AI engine for analyzing skills of a combined pool of permanent employees, contractors, part-time employees and vendors, inferring skill gaps, future trends and recommended priority areas to bridge talent gaps, which ultimately greatly improved operational efficiency, transparency, commercial model, and talent experience of our client. We leveraged a variety of ML algorithms such as boosting, neural networks and NLP transformers to provide better AI-driven insights.

One inevitable part of developing these models within a typical DS workflow is iteration. Databricks' end-to-end ML/DS workflow service, MLflow, helped streamline this process by organizing them into experiments that tracked the data used for training/testing, model artifacts, lineage and the corresponding results/metrics. For checking the health of our models using drift detection, bias and explainability techniques, MLflow's deploying, and monitoring services were leveraged extensively.

Our solution built on Databricks platform, simplified ML by defining a data-centric workflow that unified best practices from DevOps, DataOps, and ModelOps. Databricks Feature Store allowed us to productionize our models and features jointly. Insights were done with visually appealing charts and graphs using PowerBI, plotly, matplotlib, that answer business questions most relevant to clients. We built our own advanced custom analytics platform on top of delta lake as Delta’s ACID guarantees allows us to build a real-time reporting app that displays consistent and reliable data - React (for front-end), Structured Streaming for ingesting data from Delta table with live query analytics on real time data ML predictions based on analytics data.

Talk by: Nitu Nivedita

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Sponsored by: Avanade | Enabling Real-Time Analytics with Structured Streaming and Delta Live Tables

Join the panel to hear how Avanade is helping clients enable real-time analytics and tackle the people and process problems that accompany technology, powered by Azure Databricks.

Talk by: Thomas Kim, Dael Williamson, Zoé Durand

Here’s more to explore: Data, Analytics, and AI Governance: https://dbricks.co/44gu3YU

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Unity Catalog, Delta Sharing and Data Mesh on Databricks Lakehouse

In this technical deep dive, we will detail how customers implemented data mesh on Databricks and how standardizing on delta format enabled delta-to-delta share to non-Databricks consumers.

  • Current state of the IT landscape
  • Data silos (problems with organizations not having connected data in the ecosystem)
  • A look back on why we moved away from data warehouses and choose cloud in the first place
  • What caused the data chaos in the cloud (instrumentation and too much stitching together) ~ periodic table list of services of the cloud
  • How to strike the balance between autonomy and centralization
  • Why Databricks Unity Catalog puts you in the right path to implementing data mesh strategy
  • What are the process and features that enable and end-to-end Implementation of a data strategy
  • How customers were able to successfully implement the data mesh on out of the box Unity Catalog and delta sharing without overwhelming their IT tool stack
  • Use cases
  • Delta-to-delta data sharing
  • Delta-to-others data sharing
  • How do you navigate when data today is available across regions, across clouds, on-prem and external systems
  • Change data feed to share only “data that has changed”
  • Data stewardship
  • Why ABAC is important
  • How file based access policies and governance play an important role
  • Future state and its pitfalls
  • Egress costs
  • Data compliances

Talk by: Surya Turaga and Thomas Roach

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

What’s New with Data Sharing and Collaboration on the Lakehouse: From Delta Sharing to Clean Rooms

Get ready to accelerate your data and AI collaboration game with the Databricks product team. Join us as we build the next generation of secure data collaboration capabilities on the lakehouse. Whether you're just starting your data sharing journey or exploring advanced data collaboration features like data cleanrooms, this session is tailor-made for you.

In this demo-packed session, you'll discover what’s new in Delta Sharing including dynamic and materialized views for sharing, sharing other assets such as notebooks, ML models, new Delta Sharing open source connectors for the tools of your choice, and updates to Databricks cleanroom. Learn how lakehouse is the perfect solution for your data and AI collaboration requirements, across clouds, regions and platforms and without any vendor lock-in. Plus, you'll get a peek into our upcoming roadmap. Ask any burning questions you have for our expert product team as they build a collaborative lakehouse for data, analytics and AI.

Talk by: Erika Ehrli, Kelly Albano, and Xiaotong Sun

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Taking Control of Streaming Healthcare Data

Chesapeake Regional Information System for our Patients (CRISP), a nonprofit healthcare information exchange (HIE), initially partnered with Slalom to build a Databricks data lakehouse architecture in response to the analytics demands of the COVID-19 pandemic, since then they have expanded the platform to additional use cases. Recently they have worked together to engineer streaming data pipelines to process healthcare messages, such as HL7, to help CRISP become vendor independent.

This session will focus on the improvements CRISP has made to their data lakehouse platform to support streaming use cases and the impact these changes have had for the organization. We will touch on using Databricks Auto Loader to efficiently ingest incoming files, ensuring data quality with Delta Live Tables, and sharing data internally with a SQL warehouse, as well as some of the work CRISP has done to parse and standardize HL7 messages from hundreds of sources. These efforts have allowed CRISP to stream over 4 million messages daily in near real-time with the scalability it needs to continue to onboard new healthcare providers so it can continue to facilitate care and improve health outcomes.

Talk by: Andy Hanks and Chris Mantz

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Databricks As Code:Effectively Automate a Secure Lakehouse Using Terraform for Resource Provisioning

At Rivian, we have automated more than 95% of our Databricks resource provisioning workflows using an in-house Terraform module, affording us a lean admin team to manage over 750 users. In this session, we will cover the following elements of our approach and how others can benefit from improved team efficiency.

  • User and service principal management
  • Our permission model on Unity Catalog for data governance
  • Workspace and secrets resource management
  • Managing internal package dependencies using init scripts
  • Facilitating dashboards, SQL queries and their associated permissions
  • Scaling source of truth Petabyte scale Delta Lake table ingestion jobs and workflows

Talk by: Jason Shiverick and Vadivel Selvaraj

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

MLOps at Gucci: From Zero to Hero

Delta Lake is an open-source storage format that can be ideally used for storing large-scale datasets, which can be used for single-node and distributed training of deep learning models. Delta Lake storage format gives deep learning practitioners unique data management capabilities for working with their datasets. The challenge is that, as of now, it’s not possible to use Delta Lake to train PyTorch models directly.

PyTorch community has recently introduced a Torchdata library for efficient data loading. This library supports many formats out of the box, but not Delta Lake. This talk will demonstrate using the Delta Lake storage format for single-node and distributed PyTorch training using the torchdata framework and standalone delta-rs Delta Lake implementation.

Talk by: Michael Shtelma

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

How to Build a Metadata Driven Data Pipelines with Delta Live Tables

In this session, you will learn how you can use metaprogramming to automate the creation and management of Delta Live Tables pipelines at scale. The goal is to make it easy to use DLT for large-scale migrations, and other use cases that require ingesting and managing hundreds or thousands of tables, using generic code components and configuration-driven pipelines that can be dynamically reused across different projects or datasets.

Talk by: Mojgan Mazouchi and Ravi Gawai

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Building & Managing a Data Platform for a Delta Lake Exceeding 13PB & 1000s of Users: AT&T's Story

Data runs AT&T’s business, just like it runs most businesses these days. Data can lead to a greater understanding of a business and when translated correctly into information can provide human and business systems valuable insights to make better decisions. Unique to AT&T is the volume of data we support, how much of our work that is driven by AI and the scale at which data and AI drive value for our customers and stakeholders.

Our cloud migration journey includes making data and AI more accessible to employees throughout AT&T so they can use their deep business expertise to leverage data more easily and rapidly. We always had to balance this data democratization and desire for speed with keeping our data private and secure. We loved the open ecosystem model of Lakehouse that enables data, BI and ML tools to be seamlessly integrated on a single pane arena; it simplifies the architecture and reduces dependencies between technologies in the cloud. Being clear in our architecture guidelines and patterns was very important to us for our success.

We are seeing more interest from our business unit partners and continuing to build the AI capability AI as a service to support more citizen data scientists. To scale up our Lakehouse journey, we built a Databricks center of excellence (CoE) function in AT&T which today has approximately 1400+ active members, further concentrating existing expertise and resources in ML/AI discipline to collaborate on all things Databricks like technical support, trainings, FAQ’s and best practices to attain and sustain world-class performance and drive business value for AT&T. Join us to learn more about how we process and manage over 10 petabytes of our network Lakehouse with Delta Lake and Databricks.