talk-data.com talk-data.com

Event

Data + AI Summit 2025

2025-06-09 – 2025-06-13 Databricks Summit Visit website ↗

Activities tracked

64

Filtering by: Data Streaming ×

Sessions & talks

Showing 1–25 of 64 · Newest first

Search within this event →
Supercharging Sales Intelligence: Processing Billions of Events via Structured Streaming

Supercharging Sales Intelligence: Processing Billions of Events via Structured Streaming

2025-06-12 Watch
talk
Anurag Bharati (DigiCert) , Nikita Raje (DigiCert)

DigiCert is a digital security company that provides digital certificates, encryption and authentication services and serves 88% of the Fortune 500, securing over 28 billion web connections daily. Our project aggregates and analyzes certificate transparency logs via public APIs to provide comprehensive market and competitive intelligence. Instead of relying on third-party providers with limited data, our project gives full control, deeper insights and automation. Databricks has helped us reliably poll public APIs in a scalable manner that fetches millions of events daily, deduplicate and store them in our Delta tables. We specifically use Spark for parallel processing, structured streaming for real-time ingestion and deduplication, Delta tables for data reliability, pools and jobs to ensure our costs are optimized. These technologies help us keep our data fresh, accurate and cost effective. This data has helped our sales team with real-time intelligence, ensuring DigiCert's success.

What’s New in Databricks SQL: Latest Features and Live Demos

What’s New in Databricks SQL: Latest Features and Live Demos

2025-06-12 Watch
talk
Gaurav Saraf (Databricks) , Kent Marten (Databricks)

Databricks SQL has added significant features in the last year at a fast pace. This session will share the most impactful features and the customer use cases that inspired them. We will highlight the new SQL editor, SQL coding features, streaming tables and materialized views, BI integrations, cost management features, system tables and observability features, and more. We will also share AI-powered performance optimizations.

Sponsored by: Redpanda | IoT for Fun & Prophet: Scaling IoT and predicting the future with Redpanda, Iceberg & Prophet

Sponsored by: Redpanda | IoT for Fun & Prophet: Scaling IoT and predicting the future with Redpanda, Iceberg & Prophet

2025-06-12 Watch
lightning_talk
Bryan Wood (Redpanda Data)

In this talk, we’ll walk through a complete real-time IoT architecture—from an economical, high-powered ESP32 microcontroller publishing environmental sensor data to AWS IoT, through Redpanda Connect into a Redpanda BYOC cluster, and finally into Apache Iceberg for long-term analytical storage. Once the data lands, we’ll query it using Python and perform linear regression with Prophet to forecast future trends. Along the way, we’ll explore the design of a scalable, cloud-native pipeline for streaming IoT data. Whether you're tracking the weather or building the future, this session will help you architect with confidence—and maybe even predict it.

Eliminate Hops in Your Streaming Architecture with Zerobus, Part of Lakeflow Connect

2025-06-12
talk
Victoria Bukta (Databricks) , Nikola Obradovic (Databricks)

In this session, we’ll introduce Zerobus Direct Write API, part of Lakeflow Connect, which enables you to push data directly to your lakehouse and simplify ingestion for IOT, clickstreams, telemetry, and more. We’ll start with an overview of the ingestion landscape to date. Then, we'll cover how you can “shift left” with Zerobus, embedding data ingestion into your operational systems to make analytics and AI a core component of the business, rather than an afterthought. The result is a significantly simpler architecture that scales your operations, using this new paradigm to skip unnecessary hops. We'll also highlight one of our early customers, Joby Aviation and how they use Zerobus. Finally, we’ll provide a framework to help you understand when to use Zerobus versus other ingestion offerings—and we’ll wrap up with a live Q&A so that you can hit the ground running with your own use cases.

Hands-on Learning: AI-Powered Data Engineering with Lakeflow: Techniques for Modern Data Professionals (repeat)

2025-06-12
talk
Frank Munz (Databricks)

This session is repeated. This introductory workshop caters to data engineers seeking hands-on experience and data architects looking to deepen their knowledge. The workshop is structured to provide a solid understanding of the following data engineering and streaming concepts: Introduction to Lakeflow and the Data Intelligence Platform Getting started with Lakeflow Declarative Pipelines for declarative data pipelines in SQL using Streaming Tables and Materialized Views Mastering Databricks Workflows with advanced control flow and triggers Understanding serverless compute Data governance and lineage with Unity Catalog Generative AI for Data Engineers: Genie and Databricks Assistant We believe you can only become an expert if you work on real problems and gain hands-on experience. Therefore, we will equip you with your own lab environment in this workshop and guide you through practical exercises like using GitHub, ingesting data from various sources, creating batch and streaming data pipelines, and more.

What’s New in Apache Spark™ 4.0?

What’s New in Apache Spark™ 4.0?

2025-06-12 Watch
talk
Wenchen Fan (Databricks) , Daniel Tenedorio (Databricks)

Join this session for a concise tour of Apache Spark™ 4.0’s most notable enhancements: SQL features: ANSI by default, scripting, SQL pipe syntax, SQL UDF, session variable, view schema evolution, etc. Data type: VARIANT type, string collation Python features: Python data source, plotting API, etc. Streaming improvements: State store data source, state store checkpoint v2, arbitrary state v2, etc. Spark Connect improvements: More API coverage, thin client, unified Scala interface, etc. Infrastructure: Better error message, structured logging, new Java/Scala version support, etc. Whether you’re a seasoned Spark user or new to the ecosystem, this talk will prepare you to leverage Spark 4.0’s latest innovations for modern data and AI pipelines.

Sponsored by: IBM | How to leverage unstructured data to build more accurate, trustworthy AI agents

Sponsored by: IBM | How to leverage unstructured data to build more accurate, trustworthy AI agents

2025-06-12 Watch
lightning_talk

As AI adoption accelerates, unstructured data has emerged as a critical—yet often overlooked—asset for building accurate, trustworthy AI agents. But preparing and governing this data at scale remains a challenge. Traditional data integration and RAG approaches fall short. In this session, discover how IBM enables AI agents grounded in governed, high-quality unstructured data. Learn how our unified data platform streamlines integration across batch, streaming, replication, and unstructured sources—while accelerating data intelligence through built-in governance, quality, lineage, and data sharing. But governance doesn’t stop at data. We’ll explore how AI governance extends oversight to the models and agents themselves. Walk away with practical strategies to simplify your stack, strengthen trust in AI outputs, and deliver AI-ready data at scale.

Better Together: Change Data Feed in a Streaming Data Flow

Better Together: Change Data Feed in a Streaming Data Flow

2025-06-12 Watch
talk
Mattias Moser (84.51 LLC) , Scott Gordon (84.51˚)

Traditional streaming works great when your data source is append-only, but what if your data source includes updates and deletes? At 84.51 we used Lakeflow Declarative Pipelines and Delta Lake to build a streaming data flow that consumes inserts, updates and deletes while still taking advantage of streaming checkpoints. We combined this flow with a materialized view and Enzyme incremental refresh for a low-code, efficient and robust end-to-end data flow.We process around 8 million sales transactions each day with 80 million items purchased. This flow not only handles new transactions but also handles updates to previous transactions.Join us to learn how 84.51 combined change data feed, data streaming and materialized views to deliver a “better together” solution.84.51 is a retail insights, media & marketing company. We use first-party retail data from 60 million households sourced through a loyalty card program to drive Kroger’s customer-centric journey.

Healthcare Interoperability: End-to-End Streaming FHIR Pipelines With Databricks & Redox

Healthcare Interoperability: End-to-End Streaming FHIR Pipelines With Databricks & Redox

2025-06-12 Watch
talk
Tim Kessler (Redox, Inc.) , Matthew Giglia (Databricks)

Redox & Databricks direct integration can streamline your interoperability workflows from responding in record time to preauthorization requests to letting attending physicians know about a change in risk for sepsis and readmission in near real time from ADTs. Data engineers will learn how to create fully-streaming ETL pipelines for ingesting, parsing and acting on insights from Redox FHIR bundles delivered directly to Unity Catalog volumes. Once available in the Lakehouse, AI/BI Dashboards and Agentic Frameworks help write FHIR messages back to Redox for direct push down to EMR systems. Parsing FHIR bundle resources has never been easier with SQL combined with the new VARIANT data type in Delta and streaming table creation against Serverless DBSQL Warehouses. We'll also use Databricks accelerators dbignite and redoxwrite for writing and posting FHIR bundles back to Redox integrated EMRs and we'll extend AI/BI with Unity Catalog SQL UDFs and the Redox API for use in Genie.

Introducing Simplified State Tracking in Apache Spark™ Structured Streaming

Introducing Simplified State Tracking in Apache Spark™ Structured Streaming

2025-06-12 Watch
lightning_talk
Craig Lukasik (Databricks)

This presentation will review the new change feed and snapshot capabilities in Apache Spark™ Structured Streaming’s State Reader API. The State Reader API enables users to access and analyze Structured Streaming's internal state data. Readers will learn how to leverage the new features to debug, troubleshoot and analyze state changes efficiently, making streaming workloads easier to manage at scale.

Leveling Up Gaming Analytics: How Supercell Evolved Player Experiences With Snowplow and Databricks

Leveling Up Gaming Analytics: How Supercell Evolved Player Experiences With Snowplow and Databricks

2025-06-12 Watch
lightning_talk
Alex Dean (Snowplow)

In the competitive gaming industry, understanding player behavior is key to delivering engaging experiences. Supercell, creators of Clash of Clans and Brawl Stars, faced challenges with fragmented data and limited visibility into user journeys. To address this, they partnered with Snowplow and Databricks to build a scalable, privacy-compliant data platform for real-time insights. By leveraging Snowplow’s behavioral data collection and Databricks’ Lakehouse architecture, Supercell achieved: Cross-platform data unification: A unified view of player actions across web, mobile and in-game Real-time analytics: Streaming event data into Delta Lake for dynamic game balancing and engagement Scalable infrastructure: Supporting terabytes of data during launches and live events AI & ML use cases: Churn prediction and personalized in-game recommendations This session explores Supercell’s data journey and AI-driven player engagement strategies.

Sponsored by: Confluent | Turn SAP Data into AI-Powered Insights with Databricks

Sponsored by: Confluent | Turn SAP Data into AI-Powered Insights with Databricks

2025-06-12 Watch
talk
Rodrigo Sanchez Bredee (Confluent) , Sean Falconer (Confluent)

Learn how Confluent simplifies real-time streaming of your SAP data into AI-ready Delta tables on Databricks. In this session, you'll see how Confluent’s fully managed data streaming platform—with unified Apache Kafka® and Apache Flink®—connects data from SAP S/4HANA, ECC, and 120+ other sources to enable easy development of trusted, real-time data products that fuel highly contextualized AI and analytics. With Tableflow, you can represent Kafka topics as Delta tables in just a few clicks—eliminating brittle batch jobs and custom pipelines. You’ll see a product demo showcasing how Confluent unites your SAP and Databricks environments to unlock ERP-fueled AI, all while reducing the total cost of ownership (TCO) for data streaming by up to 60%.

Creating a Custom PySpark Stream Reader with PySpark 4.0

Creating a Custom PySpark Stream Reader with PySpark 4.0

2025-06-11 Watch
lightning_talk
Skyler Myers (Entrada)

PySpark supports many data sources out of the box, such as Apache Kafka, JDBC, ODBC, Delta Lake, etc. However, some older systems, such as systems that use JMS protocol, are not supported by default and require considerable extra work for developers to read from them. One such example is ActiveMQ for streaming. Traditionally, users of ActiveMQ have to use a middle-man in order to read the stream with Spark (such as writing to a MySQL DB using Java code and reading that table with Spark JDBC). With PySpark 4.0’s custom data sources (supported in DBR 15.3+) we are able to cut out the middle-man processing using batch or Spark Streaming and consume the queues directly from PySpark, saving developers considerable time and complexity in getting source data into your Delta Lake and governed by Unity Catalog and orchestrated with Databricks Workflows.

Disney's Foundational Medallion: A Journey Into Next-Generation Data Architecture

Disney's Foundational Medallion: A Journey Into Next-Generation Data Architecture

2025-06-11 Watch
lightning_talk

Step into the world of Disney Streaming as we unveil the creation of our Foundational Medallion, a cornerstone in our architecture that redefines how we manage data at scale. In this session, we'll explore how we tackled the multi-faceted challenges of building a consistent, self-service surrogate key architecture — a foundational dataset for every ingested stream powering Disney Streaming's data-driven decisions. Learn how we streamlined our architecture and unlocked new efficiencies by leveraging cutting-edge Databricks features such as liquid clustering, Photon with dynamic file pruning, Delta's identity column, Unity Catalog and more — transforming our implementation into a simpler, more scalable solution. Join us on this thrilling journey as we navigate the twists and turns of designing and implementing a new Medallion at scale — the very heartbeat of our streaming business!

Innovating Retail Data: Unilever’s Transformation with Databricks Lakeflow Declarative Pipelines

2025-06-11
talk
Evan Cherney (Unilever)

Retail data is expanding at an unprecedented rate, demanding a scalable, cost-efficient, and near real-time architecture. At Unilever, we transformed our data management approach by leveraging Databricks Lakeflow Declarative Pipelines, achieving approximately $500K in cost savings while accelerating computation speeds by 200–500%.By adopting a streaming-driven architecture, we built a system where data flows continuously across processing layers, enabling real-time updates with minimal latency.Lakeflow Declarative Pipelines' serverless simplicity replaced complex-dependency management, reducing maintenance overhead, and improving pipeline reliability. Lakeflow Declarative Pipelines Direct Publishing further enhanced data segmentation, concurrency, and governance, ensuring efficient and scalable data operations while simplifying workflows.This transformation empowers Unilever to manage data with greater efficiency, scalability, and reduced costs, creating a future-ready infrastructure that evolves with the needs of our retail partners and customers.

Mastering Change Data Capture With Lakeflow Declarative Pipelines

Mastering Change Data Capture With Lakeflow Declarative Pipelines

2025-06-11 Watch
talk
Ray Zhu (Databricks) , Jacob Gollub (Square)

Transactional systems are a common source of data for analytics, and Change Data Capture (CDC) offers an efficient way to extract only what’s changed. However, ingesting CDC data into an analytics system comes with challenges, such as handling out-of-order events or maintaining global order across multiple streams. These issues often require complex, stateful stream processing logic.This session will explore how Lakeflow Declarative Pipelines simplifies CDC ingestion using the Apply Changes function. With Apply Changes, global ordering across multiple change feeds is handled automatically — there is no need to manually manage state or understand advanced streaming concepts like watermarks. It supports both snapshot-based inputs from cloud storage and continuous change feeds from systems like message buses, reducing complexity for common streaming use cases.

Race to Real-Time: Low-Latency Streaming ETL Meets Next-Gen Databricks OLTP-DB

Race to Real-Time: Low-Latency Streaming ETL Meets Next-Gen Databricks OLTP-DB

2025-06-11 Watch
lightning_talk
Irfan Elahi (Databricks)

In today’s digital economy, real-time insights and rapid responsiveness are paramount to delivering exceptional user experiences and lowering TCO. In this session, discover a pioneering approach that leverages a low-latency streaming ETL pipeline built with Spark Structured Streaming and Databricks’ new OLTP-DB—a serverless, managed Postgres offering designed for transactional workloads. Validated in a live customer scenario, this architecture achieves sub-2 second end-to-end latency by seamlessly ingesting streaming data from Kinesis and merging it into OLTP-DB. This breakthrough not only enhances performance and scalability but also provides a replicable blueprint for transforming data pipelines across various verticals. Join us as we delve into the advanced optimization techniques and best practices that underpin this innovation, demonstrating how Databricks’ next-generation solutions can revolutionize real-time data processing and unlock a myriad of new use cases in data landscape.

Crypto at Scale: Building a High-Performance Platform for Real-Time Blockchain Data

Crypto at Scale: Building a High-Performance Platform for Real-Time Blockchain Data

2025-06-11 Watch
talk
Matthew Moorcroft (Databricks) , Ferran Cabezas Castellvi (Elliptic)

In today’s fast-evolving crypto landscape, organizations require fast, reliable intelligence to manage risk, investigate financial crime, and stay ahead of evolving threats. In this session we will discover how Elliptic built a scalable, high-performance Data Intelligence Platform that delivers real-time, actionable Blockchain insights to their customers. We’ll walk you through some of the key components of the Elliptic Platform, including the Elliptic Entity Graph and our User-Facing Analytics. Our focus will be put on the evolution of our User-Facing Analytics capabilities, and specifically how components from the Databricks ecosystem such as Structured Streaming, Delta Lake, and SQL Warehouse have played a vital role. We’ll also share some of the optimizations we’ve made to our streaming jobs to maximize performance and ensure Data Completeness. Whether you’re looking to enhance your streaming capabilities, expand your knowledge of how crypto analytics works or simply discover novel approaches to data processing at scale, this session will provide concrete strategies and valuable lessons learned.

Delivering Sub-Second Latency for Operational Workloads on Databricks

Delivering Sub-Second Latency for Operational Workloads on Databricks

2025-06-11 Watch
talk
Karthikeyan Ramasamy (Databricks) , Jerry Peng (Databricks)

As enterprise streaming adoption accelerates, more teams are turning to real-time processing to support operational workloads that require sub-second response times. To address this need, Databricks introduced Project Lightspeed in 2022, which recently delivered Real-Time Mode in Apache Spark™ Structured Streaming. This new mode achieves consistent p99 latencies under 300ms for a wide range of stateless and stateful streaming queries. In this session, we’ll define what constitutes an operational use case, outline typical latency requirements and walk through how to meet those SLAs using Real-Time Mode in Structured Streaming.

No More Fragile Pipelines: Kafka and Iceberg the Declarative Way

No More Fragile Pipelines: Kafka and Iceberg the Declarative Way

2025-06-11 Watch
talk
Adi Polak (Confluent)

Moving data between operational systems and analytics platforms is often painful. Traditional pipelines become complex, brittle, and expensive to maintain.Take Kafka and Iceberg: batching on Kafka causes ingestion bottlenecks, while streaming-style writes to Iceberg create too many small Parquet files—cluttering metadata, degrading queries, and increasing maintenance overhead. Frequent updates further strain background table operations, causing retries—even before dealing with schema evolution. But much of this complexity is avoidable. What if Kafka Topics and Iceberg Tables were treated as two sides of the same coin? By establishing a transparent equivalence, we can rethink pipeline design entirely. This session introduces Tableflow—a new approach to bridging streaming and table-based systems. It shifts complexity away from pipelines and into a unified layer, enabling simpler, declarative workflows. We’ll cover schema evolution, compaction, topic-to-table mapping, and how to continuously materialize and optimize thousands of topics as Iceberg tables. Whether modernizing or starting fresh, you’ll leave with practical insights for building resilient, scalable, and future-proof data architectures.

Scaling Identity Graph Ingestion to 1M Events/Sec with Spark Streaming & Delta Lake

Scaling Identity Graph Ingestion to 1M Events/Sec with Spark Streaming & Delta Lake

2025-06-11 Watch
talk
Akanksha Nagpal (Adobe) , Jianmei Ye (Adobe, Inc.)

Adobe’s Real-Time Customer Data Platform relies on the identity graph to connect over 70 billion identities and deliver personalized experiences. This session will showcase how the platform leverages Databricks, Spark Streaming and Delta Lake, along with 25+ Databricks deployments across multiple regions and clouds — Azure & AWS — to process terabytes of data daily and handle over a million records per second. The talk will highlight the platform’s ability to scale, demonstrating a 10x increase in ingestion pipeline capacity to accommodate peak traffic during events like the Super Bowl. Attendees will learn about the technical strategies employed, including migrating from Flink to Spark Streaming, optimizing data deduplication, and implementing robust monitoring and anomaly detection. Discover how these optimizations enable Adobe to deliver real-time identity resolution at scale while ensuring compliance and privacy.

Somebody Set Up Us the Bomb: Identifying List Bombing of End Users in an Email Anti-Spam Context

Somebody Set Up Us the Bomb: Identifying List Bombing of End Users in an Email Anti-Spam Context

2025-06-11 Watch
lightning_talk
Doug Sibley (Cisco Talos)

Traditionally, spam emails are messages a user does not want, containing some kind of threat like phishing. Because of this, detection systems can focus on malicious content or sender behavior. List bombing upends this paradigm. By abusing public forms such as marketing signups, attackers can fill a user's inbox with high volumes of legitimate mail. These emails don't contain threats, and each sender is following best practices to confirm the recipient wants to be subscribed, but the net effect for an end user is their inbox being flooded with dozens of emails per minute. This talk covers the the exploration and implementation for identifying this attack in our company's anti-spam telemetry: from reading and writing to Kafka, Delta table streaming for ETL workflows, multi-table liquid clustering design for efficient table joins, curating gold tables to speed up critical queries and using Delta tables as an auditable integration point for interacting with external services.

Metadata-Driven Streaming Ingestion Using Lakeflow Declarative Pipelines, Azure Event Hubs and a Schema Registry

Metadata-Driven Streaming Ingestion Using Lakeflow Declarative Pipelines, Azure Event Hubs and a Schema Registry

2025-06-11 Watch
talk
Vicky Avison (Plexure)

At Plexure, we ingest hundreds of millions of customer activities and transactions into our data platform every day, fuelling our personalisation engine and providing insights into the effectiveness of marketing campaigns.We're on a journey to transition from infrequent batch ingestion to near real-time streaming using Azure Event Hubs and Lakeflow Declarative Pipelines. This transformation will allow us to react to customer behaviour as it happens, rather than hours or even days later.It also enables us to move faster in other ways. By leveraging a Schema Registry, we've created a metadata-driven framework that allows data producers to: Evolve schemas with confidence, ensuring downstream processes continue running smoothly. Seamlessly publish new datasets into the data platform without requiring Data Engineering assistance. Join us to learn more about our journey and see how we're implementing this with Lakeflow Declarative Pipelines meta-programming - including a live demo of the end-to-end process!

PDF Document Ingestion Accelerator for GenAI Applications

PDF Document Ingestion Accelerator for GenAI Applications

2025-06-11 Watch
talk
Qian Yu (Databricks)

Databricks Financial Service customers in the GenAI space have a common use case of ingestion and processing of unstructured documents — PDF/images — then performing downstream GenAI tasks such as entity extraction and RAG based knowledge Q&A. The pain points for the customers for these types of use cases are: The quality of the PDF/image documents varies since many older physical documents were scanned into electronic form The complexity of the PDF/image documents varies and many contain tables — images with embedding information — which require slower Tesseract OCR They would like to streamline postprocess for downstream workloads In this talk we will present an optimized structured streaming workflow for complex PDF ingestion. The key techniques include Apache Spark™ optimization, multi-threading, PDF object extraction, skew handling and auto retry logics

Reinvent Government in an Data Intelligence Era

Reinvent Government in an Data Intelligence Era

2025-06-11 Watch
talk
Asim Qureshi (Databricks) , Ricky Arora (Met Council Environmental Services) , Eric Popowich (Databricks)

To dramatically transform the way citizen services are delivered, organizations must bring all data together — streaming, structured and unstructured — in a secure and governed platform.