talk-data.com talk-data.com

Topic

Delta

Delta Lake

data_lake acid_transactions time_travel file_format storage

347

tagged

Activity Trend

117 peak/qtr
2020-Q1 2026-Q1

Activities

347 activities · Newest first

Enabling BI in a Lakehouse Environment: How Spark and Delta Can Help With Automating a DWH Develop

Traditional data warehouses typically struggle when it comes to handling large volumes of data and traffic, particularly when it comes to unstructured data. In contrast, data lakes overcome such issues and have become the central hub for storing data. We outline how we can enable BI Kimball data modelling in a Lakehouse environment.

We present how we built a Spark-based framework to modernize DWH development with data lake as central storage, assuring high data quality and scalability. The framework was implemented at over 15 enterprise data warehouses across Europe.

We present how one can tackle in Spark & with Delta Lake the data warehouse principles like surrogate, foreign and business keys, SCD type 1 and 2 etc. Additionally, we share our experiences on how such a unified data modelling framework can bridge BI with modern day use cases, such as machine learning and real time analytics. The session outlines the original challenges, the steps taken and the technical hurdles we faced.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Ensuring Correct Distributed Writes to Delta Lake in Rust with Formal Verification

Rust guarantees zero memory access bug once a program compiles. However, one can still introduce logical bugs in the implementation.

In this talk, I will first give a high level overview on common formal verification methods used in distributed system designs and implementations. Then I will talk about our experiences with using TLA+ and Stateright to formally model delta-rs' multi-writer S3 backend implementation. The end result of combining both Rust and formal verification is we end up with an efficient native Delta Lake implementation that is both memory safe and logical bug free!

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Moving to the Lakehouse: Fast & Efficient Ingestion with Auto Loader

Auto loader, the most popular tool for incremental data ingestion from cloud storage to Databricks’ Lakehouse, is used in our biggest customers’ ingestion workflows. Auto Loader is our all-in-one solution for exactly-once processing offering efficient file discovery, schema inference and evolution, and fault tolerance.

In this talk, we want to delve into key features in Auto Loader, including: • Avro schema inference • Rescued column • Semi-structured data support • Incremental listing • Asynchronous backfilling • Native listing • File-level tracking and observability

Auto Loader is also used in other Databricks features such as Delta Live Tables. We will discuss the architecture, provide a demo, and feature an Auto Loader customer speaking about their experience migrating to Auto Loader.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

DBA Perspective—Optimizing Performance Table-by-Table

As a DBA for your Organization’s Lakehouse, it’s your job to stay on top of performance & cost optimization techniques. We will discuss how to use the available Delta Lake tools to tune your jobs and optimize your tables.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Delta Lake 2.0 Overview

After three years of hard work by the Delta community, we are proud to announce the release of Delta Lake 2.0. Completing the work to open-source all of Delta Lake while tens of thousands of organizations were running in production was no small feat and we have the ever-expanding Delta community to thank! Join this session to learn about how the wider Delta community collaborated together to bring these features and integrations together.

Join this session to learn about how the wider Delta community collaborated together to bring these features and integrations together. This includes the Integrations with Apache Spark™, Apache Flink, Apache Pulsar, Presto, Trino, and more.

Features such as OPTIMIZE ZORDER, data skipping using column stats, S3 multi-cluster writes, Change Data Feed, and more.

Language APIs including Rust, Python, Ruby, GoLang, Scala, and Java.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Delta Live Tables: Modern Software Engineering and Management for ETL

Data engineers have the difficult task of cleansing complex, diverse data, and transforming it into a usable source to drive data analytics, data science, and machine learning. They need to know the data infrastructure platform in depth, build complex queries in various languages and stitch them together for production. Join this talk to learn how Delta Live Tables (DLT) simplifies the complexity of data transformation and ETL. DLT is the first ETL framework to use modern software engineering practices to deliver reliable and trusted data pipelines at any scale. Discover how analysts and data engineers can innovate rapidly with simple pipeline development and maintenance, how to remove operational complexity by automating administrative tasks and gaining visibility into pipeline operations, how built-in quality controls and monitoring ensure accurate BI, data science, and ML, and how simplified batch and streaming can be implemented with self-optimizing and auto-scaling data pipelines.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Discover Data Lakehouse With End-to-End Lineage

Data Lineage is key for managing change, ensuring data quality and implementing Data Governance in an organization. There are a few use cases for Data Lineage: Data Governance: For compliance and regulatory purposes our customers are required to prove the data/reports they are submitting came from a trusted and verified source.

This typically means identifying the tables and data sets used in a report or dashboard and tracing the source of these tables and fields. Another use case for the Governance scenario is to understand the spread of sensitive data within the lakehouse. Data Discovery: Data analysts looking to self-serve and build their own analytics and models typically spend time exploring and understanding the data in their lakehouse.

Lineage is a key piece of information which enhances the understanding and trustworthiness of the data the analyst plans to use. Problem Identification: Data teams are often called to solve errors in analysts dashboards and reports (“Why is the total number of widgets different in this report than the one I have built?”). This usually leads to an expensive forensic exercise by the DE team to understand the sources of data and the transformations applied to it before it hits the report. Change Management : It is not uncommon for data sources to change, a new source may stop delivering data or a field in the source system changes its semantics.

In this scenario the DE team would like to understand the downstream impact of this change - to get a sense of how many datasets and users will be affected by this change. This will help them determine the impact of the change, manage user expectations and address issues ahead of time In this talk, we will talk about how we capture table and column lineage for spark / delta and unity catalog for our customers in details and how users could leverage data lineage to serve various use cases mentioned above.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Enabling Business Users to Perform Interactive Ad-Hoc Analysis over Delta Lake with No Code

In this talk, we'll first introduce Sigma Workbooks along with its technical design motivations and architectural details. Sigma Workbooks is an interactive visual data analytics system that enables business users to easily perform complex ad-hoc analysis over data in cloud data warehouses (CDWs). We'll then demonstrate the expressivity, scalability, and ease-of-use of Sigma Workbooks through real-life use cases over datasets stored in Delta Lake. We’ll conclude the talk by sharing the lessons that we have learned throughout the design and implementation iterations of Sigma Workbooks.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Evolution of Data Architectures and How to Build a Lakehouse

Data architectures are the key and part of a larger picture to building robust analytical and AI applications. One must take a holistic view of the entire data analytics realm when it comes to planning for data science initiatives.

Through this talk, learn about the evolution of the data landscape and why Lakehouses are becoming a de facto for organizations building scalable data architectures. A lakehouse architecture combines data management capability including reliability, integrity, and quality from the data warehouse and supports all data workloads including BI and AI with the low cost and open approach of data lakes.

Data Practitioners will also learn some core concepts of building an efficient Lakehouse with Delta Lake.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

How AARP Services, Inc. automated SAS transformation to Databricks using LeapLogic

While SAS has been a standard in analytics and data science use cases, it is not cloud-native and does not scale well. Join us to learn how AARP automated the conversion of hundreds of complex data processing, model scoring, and campaign workloads to Databricks using LeapLogic, an intelligent code transformation accelerator that can transform any and all legacy ETL, analytics, data warehouse and Hadoop to modern data platforms.

In this session experts from AARP and Impetus will share about collaborating with Databricks and how they were able to: • Automate modernization of SAS marketing analytics based on coding best practices • Establish a rich library of Spark and Python equivalent functions on Databricks with the same capabilities as SAS procedures, DATA step operations, macros, and functions • Leverage Databricks-native services like Delta Live Tables to implement waterfall techniques for campaign execution and simplify pipeline monitoring

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Welcome &  Destination Lakehouse    Ali Ghodsi   Keynote Data + AI Summit 2022

Join the Day 1 keynote to hear from Databricks co-founders - and original creators of Apache Spark and Delta Lake - Ali Ghodsi, Matei Zaharia, and Reynold Xin on how Databricks and the open source community is taking on the biggest challenges in data. The talks will address the latest updates on the Apache Spark and Delta Lake projects, the evolution of data lakehouse architecture, and how companies like Adobe and Amgen are using lakehouse architecture to advance their data goals.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Deep-Dive into Delta Lake

Delta Lake is becoming a defacto-standard for storing big amounts data for analytical purposes in a data lake. But what is behind it? How does it work under the hood? In this session you we will dive deep into the internals of Delta Lake by unpacking the transaction log and also highlight some common pitfalls when working with Delta Lake (and show how to avoid them).

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

How McAfee Leverages Databricks on AWS at Scale

McAfee, a global leader in online protection security enables home users and businesses to stay ahead of fileless attacks, viruses, malware, and other online threats. Learn how McAfee leverages Databricks on AWS to create a centralized data platform as a single source of truth to power customer insights. We will also describe how McAfee uses additional AWS services specifically Kinesis and CloudWatch to provide real time data streaming and monitor and optimize their Databricks on AWS deployment. Finally, we’ll discuss business benefits and lessons learned during McAfee’s petabyte scale migration to Databricks on AWS using Databricks Delta clone technology coupled with network, compute, storage optimizations on AWS.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Ingesting data into Lakehouse with COPY INTO

COPY INTO is a popular data ingestion SQL command for Databricks users, especially for customers using Databricks SQL. In this talk, we want to discuss the data ingestion use cases in Databricks and how COPY INTO fits your data ingestion needs. We will discuss a few new COPY INTO features and how to achieve the following use cases: 1. Loading data into a Delta Table incrementally ; 2. Fixing errors in already loaded data and helping you with data cleansing; 3. Evolving your schema over time; 4. Previewing data before ingesting; 5. Loading data from a third party data source. In this session, we will demo the new features, discuss the architecture for the implementation, and how other Databricks features are using COPY INTO under the hood.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Interactive Analytics on a Massive Scale Using Delta Lake

Interactive, Near Real Time analytics is usually a common requirement for many data teams across different fields.

In the field of web security, interactive analytics allows end users to get real time or historical insights about the state of their protected resource at any point of time and take actions accordingly.

One of the hardest aspects of enabling interactive, near-real-time analytics on a massive scale is a low response time. Scanning hundreds of Terabytes of data over a non-aggregated stream of events (a Delta Lake), and still returning an answer within just a few seconds can be a major challenge.

In this talk we will learn: • How did we build a 5PB Delta Lake of non-aggregated security events • What challenges did we see along the way - reducing delta log scan, improving cache affinity, reducing storage throttling errors etc. • How did we overcome them one by one

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Sink Framework Evolution in Apache Flink

Apache Flink is one of the most popular frameworks for unified stream and batch processing. Like every other big data framework, Apache Flink offers connectors to different external systems to read from and write to. We refer to connectors for writing to external systems as sinks. Over the years, multiple frameworks existed inside Apache Flink for building sinks. The Apache Flink community also noticed the latest trend of ingesting real-time data directly into data lakes for further usage. Therefore with Apache Flink 1.15, we released the next iteration of our sink framework. We designed it to accommodate the needs of modern data lake connectors i.e. lazy file compaction, user-defined shuffling.

In this talk, we first give a brief historical glimpse of the evolution of the frameworks that started as a kind of a simple map operation until a custom operator model that simplified two-phase commit semantics. Secondly, we do a deep dive into Apache Flink’s fault tolerance model to explain how the last iteration of the sink framework supports exactly-once processing and complex operations important for delta lakes. In summary, this talk introduces the principles behind the sink framework in Apache Flink and gives a starting point for developers building a new connector for Apache Flink.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Day 1 Morning Keynote | Data + AI Summit 2022

Day 1 Morning Keynote | Data + AI Summit 2022 Welcome & "Destination Lakehouse" | Ali Ghodsi Apache Spark Community Update | Reynold Xin Streaming Lakehouse | Karthik Ramasamy Delta Lake | Michael Armbrust How Adobe migrated to a unified and open data Lakehouse to deliver personalization at unprecedented scale | Dave Weinstein Data Governance and Sharing on Lakehouse |Matei Zaharia Analytics Engineering and the Great Convergence | Tristan Handy Data Warehousing | Shant Hovespian Unlocking the power of data, AI & analytics: Amgen’s journey to the Lakehouse | Kerby Johnson

Get insights on how to launch a successful lakehouse architecture in Rise of the Data Lakehouse by Bill Inmon, the father of the data warehouse. Download the ebook: https://dbricks.co/3ER9Y0K

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

How Adobe migrated to a unified and open data Lakehouse to deliver personalization at scale.

In this keynote talk, David Weinstein, VP of Engineering for Adobe Experience Cloud, will share Adobe’s journey from a simple data lake to a unified, open Lakehouse architecture with Databricks. Adobe can now deliver personalized experiences at scale to diverse customers with greater speed, operational efficiency and faster innovation across the Experience Cloud portfolio. Learn why they chose to migrate from Iceberg to Delta Lake to drive its open standard development and accelerate innovation of their Lakehouse, and they’ll also share how leveraging the Delta Lake table format has allowed for techniques to support change data capture and significantly improve operational efficiency.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Self-Serve, Automated and Robust CDC pipeline using AWS DMS, DynamoDB Streams and Databricks Delta

Many companies are trying to solve the challenges of ingesting transactional data in Data lake and dealing with late-arriving updates and deletes.

To address this at Swiggy, we have built CDC(Change Data Capture) system, an incremental processing framework to power all business-critical data pipelines at low latency and high efficiency.

It offers: Freshness: It operates in near real-time with configurable latency requirements. Performance: Optimized read and write performance with tuned compaction parameters and partitions and delta table optimization. Consistency: It supports reconciliation based on transaction types. Basically applying insert, update, and delete on existing data.

To implement this system, AWS DMS helped us with initial bootstrapping and CDC replication for Mysql sources. AWS Lambda and DynamoDB streams helped us to solve the bootstrapping and CDC replication for DynamoDB source.

After setting up the bootstrap and cdc replication process we have used Databricks delta merge to reconcile the data based on the transaction types.

To support the merge we have implemented supporting features - * Deduplicating multiple mutations of the same record using log offset and time stamp. * Adding optimal partition of the data set. * Infer schema and apply proper schema evolutions(Backward compatible schema) * We have extended the delta table snapshot generation technique to create a consistent partition for partitioned delta tables.

FInally to read the data we are using Spark sql with Hive metastore and Snowflake. Delta tables read with Spark sql have implicit support for hive metastore. We have built our own implementation of the snowflake sync process to create external, internal tables and materialized views on Snowflake.

Stats: 500m CDC logs/day 600+ tables

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/