Data Streaming

Supercharging Sales Intelligence: Processing Billions of Events via Structured Streaming

2025-06-12 · Data + AI Summit 2025 Watch

talk

by Anurag Bharati (DigiCert) , Nikita Raje (DigiCert)

API Databricks Delta Cyber Security Spark

DigiCert is a digital security company that provides digital certificates, encryption and authentication services and serves 88% of the Fortune 500, securing over 28 billion web connections daily. Our project aggregates and analyzes certificate transparency logs via public APIs to provide comprehensive market and competitive intelligence. Instead of relying on third-party providers with limited data, our project gives full control, deeper insights and automation. Databricks has helped us reliably poll public APIs in a scalable manner that fetches millions of events daily, deduplicate and store them in our Delta tables. We specifically use Spark for parallel processing, structured streaming for real-time ingestion and deduplication, Delta tables for data reliability, pools and jobs to ensure our costs are optimized. These technologies help us keep our data fresh, accurate and cost effective. This data has helped our sales team with real-time intelligence, ensuring DigiCert's success.

What’s New in Databricks SQL: Latest Features and Live Demos

2025-06-12 · Data + AI Summit 2025 Watch

talk

by Gaurav Saraf (Databricks) , Kent Marten (Databricks)

AI/ML BI Databricks SQL

Databricks SQL has added significant features in the last year at a fast pace. This session will share the most impactful features and the customer use cases that inspired them. We will highlight the new SQL editor, SQL coding features, streaming tables and materialized views, BI integrations, cost management features, system tables and observability features, and more. We will also share AI-powered performance optimizations.

Sponsored by: Redpanda | IoT for Fun & Prophet: Scaling IoT and predicting the future with Redpanda, Iceberg & Prophet

Eliminate Hops in Your Streaming Architecture with Zerobus, Part of Lakeflow Connect

2025-06-12 · Data + AI Summit 2025

talk

by Victoria Bukta (Shopify) , Nikola Obradovic (Databricks)

AI/ML Analytics API Data Lakehouse IoT

In this session, we’ll introduce Zerobus Direct Write API, part of Lakeflow Connect, which enables you to push data directly to your lakehouse and simplify ingestion for IOT, clickstreams, telemetry, and more. We’ll start with an overview of the ingestion landscape to date. Then, we'll cover how you can “shift left” with Zerobus, embedding data ingestion into your operational systems to make analytics and AI a core component of the business, rather than an afterthought. The result is a significantly simpler architecture that scales your operations, using this new paradigm to skip unnecessary hops. We'll also highlight one of our early customers, Joby Aviation and how they use Zerobus. Finally, we’ll provide a framework to help you understand when to use Zerobus versus other ingestion offerings—and we’ll wrap up with a live Q&A so that you can hit the ground running with your own use cases.

Hands-on Learning: AI-Powered Data Engineering with Lakeflow: Techniques for Modern Data Professionals (repeat)

2025-06-12 · Data + AI Summit 2025

talk

by Frank Munz (Databricks)

AI/ML Data Engineering Data Governance Databricks GenAI GitHub SQL

This session is repeated. This introductory workshop caters to data engineers seeking hands-on experience and data architects looking to deepen their knowledge. The workshop is structured to provide a solid understanding of the following data engineering and streaming concepts: Introduction to Lakeflow and the Data Intelligence Platform Getting started with Lakeflow Declarative Pipelines for declarative data pipelines in SQL using Streaming Tables and Materialized Views Mastering Databricks Workflows with advanced control flow and triggers Understanding serverless compute Data governance and lineage with Unity Catalog Generative AI for Data Engineers: Genie and Databricks Assistant We believe you can only become an expert if you work on real problems and gain hands-on experience. Therefore, we will equip you with your own lab environment in this workshop and guide you through practical exercises like using GitHub, ingesting data from various sources, creating batch and streaming data pipelines, and more.

What’s New in Apache Spark™ 4.0?

2025-06-12 · Data + AI Summit 2025 Watch

talk

by Daniel Tenedorio (Databricks) , Wenchen Fan (Databricks)

AI/ML API Java Python Scala Spark SQL

Join this session for a concise tour of Apache Spark™ 4.0’s most notable enhancements: SQL features: ANSI by default, scripting, SQL pipe syntax, SQL UDF, session variable, view schema evolution, etc. Data type: VARIANT type, string collation Python features: Python data source, plotting API, etc. Streaming improvements: State store data source, state store checkpoint v2, arbitrary state v2, etc. Spark Connect improvements: More API coverage, thin client, unified Scala interface, etc. Infrastructure: Better error message, structured logging, new Java/Scala version support, etc. Whether you’re a seasoned Spark user or new to the ecosystem, this talk will prepare you to leverage Spark 4.0’s latest innovations for modern data and AI pipelines.

Sponsored by: IBM | How to leverage unstructured data to build more accurate, trustworthy AI agents

Better Together: Change Data Feed in a Streaming Data Flow

2025-06-12 · Data + AI Summit 2025 Watch

talk

by Mattias Moser (84.51 LLC) , Scott Gordon (84.51˚)

Delta Marketing

Traditional streaming works great when your data source is append-only, but what if your data source includes updates and deletes? At 84.51 we used Lakeflow Declarative Pipelines and Delta Lake to build a streaming data flow that consumes inserts, updates and deletes while still taking advantage of streaming checkpoints. We combined this flow with a materialized view and Enzyme incremental refresh for a low-code, efficient and robust end-to-end data flow.We process around 8 million sales transactions each day with 80 million items purchased. This flow not only handles new transactions but also handles updates to previous transactions.Join us to learn how 84.51 combined change data feed, data streaming and materialized views to deliver a “better together” solution.84.51 is a retail insights, media & marketing company. We use first-party retail data from 60 million households sourced through a loyalty card program to drive Kroger’s customer-centric journey.

Healthcare Interoperability: End-to-End Streaming FHIR Pipelines With Databricks & Redox

2025-06-12 · Data + AI Summit 2025 Watch

talk

by Tim Kessler (Redox, Inc.) , Matthew Giglia (Databricks)

AI/ML API Amazon EMR BI Data Lakehouse Databricks Delta ETL/ELT SQL

Redox & Databricks direct integration can streamline your interoperability workflows from responding in record time to preauthorization requests to letting attending physicians know about a change in risk for sepsis and readmission in near real time from ADTs. Data engineers will learn how to create fully-streaming ETL pipelines for ingesting, parsing and acting on insights from Redox FHIR bundles delivered directly to Unity Catalog volumes. Once available in the Lakehouse, AI/BI Dashboards and Agentic Frameworks help write FHIR messages back to Redox for direct push down to EMR systems. Parsing FHIR bundle resources has never been easier with SQL combined with the new VARIANT data type in Delta and streaming table creation against Serverless DBSQL Warehouses. We'll also use Databricks accelerators dbignite and redoxwrite for writing and posting FHIR bundles back to Redox integrated EMRs and we'll extend AI/BI with Unity Catalog SQL UDFs and the Redox API for use in Genie.

Introducing Simplified State Tracking in Apache Spark™ Structured Streaming

2025-06-12 · Data + AI Summit 2025 Watch

lightning_talk

by Craig Lukasik (Databricks)

API Spark

This presentation will review the new change feed and snapshot capabilities in Apache Spark™ Structured Streaming’s State Reader API. The State Reader API enables users to access and analyze Structured Streaming's internal state data. Readers will learn how to leverage the new features to debug, troubleshoot and analyze state changes efficiently, making streaming workloads easier to manage at scale.

Leveling Up Gaming Analytics: How Supercell Evolved Player Experiences With Snowplow and Databricks

2025-06-12 · Data + AI Summit 2025 Watch

lightning_talk

by Alex Dean (Snowplow)

AI/ML Analytics Data Collection Data Lakehouse Databricks Delta Snowplow

In the competitive gaming industry, understanding player behavior is key to delivering engaging experiences. Supercell, creators of Clash of Clans and Brawl Stars, faced challenges with fragmented data and limited visibility into user journeys. To address this, they partnered with Snowplow and Databricks to build a scalable, privacy-compliant data platform for real-time insights. By leveraging Snowplow’s behavioral data collection and Databricks’ Lakehouse architecture, Supercell achieved: Cross-platform data unification: A unified view of player actions across web, mobile and in-game Real-time analytics: Streaming event data into Delta Lake for dynamic game balancing and engagement Scalable infrastructure: Supporting terabytes of data during launches and live events AI & ML use cases: Churn prediction and personalized in-game recommendations This session explores Supercell’s data journey and AI-driven player engagement strategies.

Sponsored by: Confluent | Turn SAP Data into AI-Powered Insights with Databricks

Creating a Custom PySpark Stream Reader with PySpark 4.0

2025-06-11 · Data + AI Summit 2025 Watch

lightning_talk

by Skyler Myers (Entrada)

Databricks Delta Java Kafka MySQL PySpark Spark

PySpark supports many data sources out of the box, such as Apache Kafka, JDBC, ODBC, Delta Lake, etc. However, some older systems, such as systems that use JMS protocol, are not supported by default and require considerable extra work for developers to read from them. One such example is ActiveMQ for streaming. Traditionally, users of ActiveMQ have to use a middle-man in order to read the stream with Spark (such as writing to a MySQL DB using Java code and reading that table with Spark JDBC). With PySpark 4.0’s custom data sources (supported in DBR 15.3+) we are able to cut out the middle-man processing using batch or Spark Streaming and consume the queues directly from PySpark, saving developers considerable time and complexity in getting source data into your Delta Lake and governed by Unity Catalog and orchestrated with Databricks Workflows.

Disney's Foundational Medallion: A Journey Into Next-Generation Data Architecture

2025-06-11 · Data + AI Summit 2025 Watch

lightning_talk

by mark.senerth mark.senerth (Disney)

Databricks Delta

Step into the world of Disney Streaming as we unveil the creation of our Foundational Medallion, a cornerstone in our architecture that redefines how we manage data at scale. In this session, we'll explore how we tackled the multi-faceted challenges of building a consistent, self-service surrogate key architecture — a foundational dataset for every ingested stream powering Disney Streaming's data-driven decisions. Learn how we streamlined our architecture and unlocked new efficiencies by leveraging cutting-edge Databricks features such as liquid clustering, Photon with dynamic file pruning, Delta's identity column, Unity Catalog and more — transforming our implementation into a simpler, more scalable solution. Join us on this thrilling journey as we navigate the twists and turns of designing and implementing a new Medallion at scale — the very heartbeat of our streaming business!

Innovating Retail Data: Unilever’s Transformation with Databricks Lakeflow Declarative Pipelines

2025-06-11 · Data + AI Summit 2025

talk

by Evan Cherney (Unilever)

Data Management Databricks

Retail data is expanding at an unprecedented rate, demanding a scalable, cost-efficient, and near real-time architecture. At Unilever, we transformed our data management approach by leveraging Databricks Lakeflow Declarative Pipelines, achieving approximately $500K in cost savings while accelerating computation speeds by 200–500%.By adopting a streaming-driven architecture, we built a system where data flows continuously across processing layers, enabling real-time updates with minimal latency.Lakeflow Declarative Pipelines' serverless simplicity replaced complex-dependency management, reducing maintenance overhead, and improving pipeline reliability. Lakeflow Declarative Pipelines Direct Publishing further enhanced data segmentation, concurrency, and governance, ensuring efficient and scalable data operations while simplifying workflows.This transformation empowers Unilever to manage data with greater efficiency, scalability, and reduced costs, creating a future-ready infrastructure that evolves with the needs of our retail partners and customers.

Mastering Change Data Capture With Lakeflow Declarative Pipelines

2025-06-11 · Data + AI Summit 2025 Watch

talk

by Ray Zhu (Databricks) , Jacob Gollub (Square)

Analytics Cloud Computing Cloud Storage

Transactional systems are a common source of data for analytics, and Change Data Capture (CDC) offers an efficient way to extract only what’s changed. However, ingesting CDC data into an analytics system comes with challenges, such as handling out-of-order events or maintaining global order across multiple streams. These issues often require complex, stateful stream processing logic.This session will explore how Lakeflow Declarative Pipelines simplifies CDC ingestion using the Apply Changes function. With Apply Changes, global ordering across multiple change feeds is handled automatically — there is no need to manually manage state or understand advanced streaming concepts like watermarks. It supports both snapshot-based inputs from cloud storage and continuous change feeds from systems like message buses, reducing complexity for common streaming use cases.

Race to Real-Time: Low-Latency Streaming ETL Meets Next-Gen Databricks OLTP-DB

2025-06-11 · Data + AI Summit 2025 Watch

lightning_talk

by Irfan Elahi (Databricks)

Kinesis Databricks ETL/ELT Spark postgresql

In today’s digital economy, real-time insights and rapid responsiveness are paramount to delivering exceptional user experiences and lowering TCO. In this session, discover a pioneering approach that leverages a low-latency streaming ETL pipeline built with Spark Structured Streaming and Databricks’ new OLTP-DB—a serverless, managed Postgres offering designed for transactional workloads. Validated in a live customer scenario, this architecture achieves sub-2 second end-to-end latency by seamlessly ingesting streaming data from Kinesis and merging it into OLTP-DB. This breakthrough not only enhances performance and scalability but also provides a replicable blueprint for transforming data pipelines across various verticals. Join us as we delve into the advanced optimization techniques and best practices that underpin this innovation, demonstrating how Databricks’ next-generation solutions can revolutionize real-time data processing and unlock a myriad of new use cases in data landscape.

Crypto at Scale: Building a High-Performance Platform for Real-Time Blockchain Data

2025-06-11 · Data + AI Summit 2025 Watch

talk

by Matthew Moorcroft (Databricks) , Ferran Cabezas Castellvi (Elliptic)

Analytics Blockchain Databricks Delta SQL

In today’s fast-evolving crypto landscape, organizations require fast, reliable intelligence to manage risk, investigate financial crime, and stay ahead of evolving threats. In this session we will discover how Elliptic built a scalable, high-performance Data Intelligence Platform that delivers real-time, actionable Blockchain insights to their customers. We’ll walk you through some of the key components of the Elliptic Platform, including the Elliptic Entity Graph and our User-Facing Analytics. Our focus will be put on the evolution of our User-Facing Analytics capabilities, and specifically how components from the Databricks ecosystem such as Structured Streaming, Delta Lake, and SQL Warehouse have played a vital role. We’ll also share some of the optimizations we’ve made to our streaming jobs to maximize performance and ensure Data Completeness. Whether you’re looking to enhance your streaming capabilities, expand your knowledge of how crypto analytics works or simply discover novel approaches to data processing at scale, this session will provide concrete strategies and valuable lessons learned.

Delivering Sub-Second Latency for Operational Workloads on Databricks

2025-06-11 · Data + AI Summit 2025 Watch

talk

by Karthikeyan Ramasamy (Databricks) , Jerry Peng (Databricks)

Databricks Spark

As enterprise streaming adoption accelerates, more teams are turning to real-time processing to support operational workloads that require sub-second response times. To address this need, Databricks introduced Project Lightspeed in 2022, which recently delivered Real-Time Mode in Apache Spark™ Structured Streaming. This new mode achieves consistent p99 latencies under 300ms for a wide range of stateless and stateful streaming queries. In this session, we’ll define what constitutes an operational use case, outline typical latency requirements and walk through how to meet those SLAs using Real-Time Mode in Structured Streaming.

No More Fragile Pipelines: Kafka and Iceberg the Declarative Way

2025-06-11 · Data + AI Summit 2025 Watch

talk

by Adi Polak (Treeverse)

Analytics Iceberg Kafka Parquet

Moving data between operational systems and analytics platforms is often painful. Traditional pipelines become complex, brittle, and expensive to maintain.Take Kafka and Iceberg: batching on Kafka causes ingestion bottlenecks, while streaming-style writes to Iceberg create too many small Parquet files—cluttering metadata, degrading queries, and increasing maintenance overhead. Frequent updates further strain background table operations, causing retries—even before dealing with schema evolution. But much of this complexity is avoidable. What if Kafka Topics and Iceberg Tables were treated as two sides of the same coin? By establishing a transparent equivalence, we can rethink pipeline design entirely. This session introduces Tableflow—a new approach to bridging streaming and table-based systems. It shifts complexity away from pipelines and into a unified layer, enabling simpler, declarative workflows. We’ll cover schema evolution, compaction, topic-to-table mapping, and how to continuously materialize and optimize thousands of topics as Iceberg tables. Whether modernizing or starting fresh, you’ll leave with practical insights for building resilient, scalable, and future-proof data architectures.

talk-data.com

Activity Trend

Top Events

Top Speakers

Supercharging Sales Intelligence: Processing Billions of Events via Structured Streaming

What’s New in Databricks SQL: Latest Features and Live Demos

Sponsored by: Redpanda | IoT for Fun & Prophet: Scaling IoT and predicting the future with Redpanda, Iceberg & Prophet

Eliminate Hops in Your Streaming Architecture with Zerobus, Part of Lakeflow Connect

Hands-on Learning: AI-Powered Data Engineering with Lakeflow: Techniques for Modern Data Professionals (repeat)

What’s New in Apache Spark™ 4.0?

Sponsored by: IBM | How to leverage unstructured data to build more accurate, trustworthy AI agents

Better Together: Change Data Feed in a Streaming Data Flow

Healthcare Interoperability: End-to-End Streaming FHIR Pipelines With Databricks & Redox

Introducing Simplified State Tracking in Apache Spark™ Structured Streaming

Leveling Up Gaming Analytics: How Supercell Evolved Player Experiences With Snowplow and Databricks

Sponsored by: Confluent | Turn SAP Data into AI-Powered Insights with Databricks

Creating a Custom PySpark Stream Reader with PySpark 4.0

Disney's Foundational Medallion: A Journey Into Next-Generation Data Architecture

Innovating Retail Data: Unilever’s Transformation with Databricks Lakeflow Declarative Pipelines

Mastering Change Data Capture With Lakeflow Declarative Pipelines

Race to Real-Time: Low-Latency Streaming ETL Meets Next-Gen Databricks OLTP-DB

Crypto at Scale: Building a High-Performance Platform for Real-Time Blockchain Data

Delivering Sub-Second Latency for Operational Workloads on Databricks

No More Fragile Pipelines: Kafka and Iceberg the Declarative Way