Data Streaming

From Days to Seconds — Reducing Query Times on Large Geospatial Datasets by 99%

2025-06-10 · Data + AI Summit 2025 Watch

talk

by Chris Crawford (Databricks) , Hobson Bryan (Global Water Security Center)

CI/CD Data Lakehouse Databricks Git Cyber Security Spark SQL

The Global Water Security Center translates environmental science into actionable insights for the U.S. Department of Defense. Prior to incorporating Databricks, responding to these requests required querying approximately five hundred thousand raster files representing over five hundred billion points. By leveraging lakehouse architecture, Databricks Auto Loader, Spark Streaming, Databricks Spatial SQL, H3 geospatial indexing and Databricks Liquid Clustering, we were able to drastically reduce our “time to analysis” from multiple business days to a matter of seconds. Now, our data scientists execute queries on pre-computed tables in Databricks, resulting in a “time to analysis” that is 99% faster, giving our teams more time for deeper analysis of the data. Additionally, we’ve incorporated Databricks Workflows, Databricks Asset Bundles, Git and Git Actions to support CI/CD across workspaces. We completed this work in close partnership with Databricks.

Real-Time Market Insights — Powering Optiver’s Live Trading Dashboard with Databricks Apps and Dash

2025-06-10 · Data + AI Summit 2025 Watch

talk

by Huy Nguyen (Optiver)

Dashboard Databricks

In the fast-paced world of trading, real-time insights are critical for making informed decisions. This presentation explores how Optiver, a leading high-frequency trading firm, harnesses Databricks apps to power its live trading dashboards. The technology enables traders to analyze market data, detect patterns and respond instantly. In this talk, we will showcase how our system leverages Databricks’ scalable infrastructures such as Structured Streaming to efficiently handle vast streams of financial data while ensuring low-latency performance. In addition, we will show how the integration of Databricks apps with Dash has empowered traders to rapidly develop and deploy custom dashboards, minimizing dependency on developers. Attendees will gain insights into our architecture, data processing techniques and lessons learned in integrating Databricks apps with Dash in order to drive rapid, data-driven trading decisions.

SQL-First ETL: Building Easy, Efficient Data Pipelines With Lakeflow Declarative Pipelines

2025-06-10 · Data + AI Summit 2025 Watch

talk

by Paul Lappas (Databricks) , Ritwik Yadav (Databricks) , Meixian Li (Databricks)

Databricks dbt ETL/ELT SQL

This session explores how SQL-based ETL can accelerate development, simplify maintenance and make data transformation more accessible to both engineers and analysts. We'll walk through how Databricks Lakeflow Declarative Pipelines and Databricks SQL warehouse support building production-grade pipelines using familiar SQL constructs.Topics include: Using streaming tables for real-time ingestion and processing Leveraging materialized views to deliver fast, pre-computed datasets Integrating with tools like dbt to manage batch and streaming workflows at scale By the end of the session, you’ll understand how SQL-first approaches can streamline ETL development and support both operational and analytical use cases.

Unifying Human-Curated Data Ingestion and Real-Time Updates with Databricks Lakeflow Declarative Pipelines, Protobuf and BSR

2025-06-10 · Data + AI Summit 2025

talk

by Dwight Whitlock (Clinician Nexus)

Data Governance Databricks Kafka Protobuf

Red Stapler is a streaming-native system on Databricks that merges file-based ingestion and real-time user edits into one Lakeflow Declarative Pipelines for near real-time feedback. Protobuf definitions, managed in the Buf Schema Registry (BSR), govern schema and data-quality rules, ensuring backward compatibility. All records — valid or not — are stored in an SCD Type 2 table, capturing every version for full history and immediate quarantine views of invalid data. This unified approach boosts data governance, simplifies auditing and streamlines error fixes.Running on Lakeflow Declarative Pipelines Serverless and the Kafka-compatible Bufstream keeps costs low by scaling down to zero when idle. Red Stapler’s configuration-driven Protobuf logic adapts easily to evolving survey definitions without risking production. The result is consistent validation, quick updates and a complete audit trail — all critical for trustworthy, flexible data pipelines.

Unlocking Streaming Power: How SEGA Wins With Lakeflow Declarative Pipelines

2025-06-10 · Data + AI Summit 2025

talk

by Felix Baker (SEGA Europe Limited) , Craig Porteous (Advancing Analytics)

Analytics Data Quality Microsoft

Streaming data is hard and costly — that's the default opinion, but it doesn’t have to be.In this session, discover how SEGA simplified complex streaming pipelines and turned them into a competitive edge. SEGA sees over 40,000 events per second. That's no easy task, but enabling personalised gaming experiences for over 50 million gamers drives a huge competitive advantage. If you’re wrestling with streaming challenges, this talk is your next checkpoint.We’ll unpack how Lakeflow Declarative Pipelines helped SEGA, from automated schema evolution and simple data quality management to seamless streaming reliability. Learn how Lakeflow Declarative Pipelines drives value by transforming chaos emeralds into clarity, delivering results for a global gaming powerhouse. We'll step through the architecture, approach and challenges we overcame.Join Craig Porteous, Microsoft MVP from Advancing Analytics, and Felix Baker, Head of Data Services at SEGA Europe, for a fast-paced, hands-on journey into Lakeflow Declarative Pipelines’ unique powers.

Why You Should Move to Lakeflow Declarative Pipelines Serverless

2025-06-10 · Data + AI Summit 2025 Watch

lightning_talk

by Nandini N (Databricks)

ETL/ELT

Lakeflow Declarative Pipelines Serverless offers a range of benefits that make it an attractive option for organizations looking to optimize their ETL (Extract, Transform, Load) processes.Key benefits of Lakeflow Declarative Pipelines Serverless: Automatic infrastructure management Unified batch and streaming Cost and performance optimization Simplified configuration Granular observability By moving to Lakeflow Declarative Pipelines Serverless, organizations can achieve faster, more reliable, and cost-effective data pipeline management, ultimately driving better business insights and outcomes.

A Unified Solution for Data Management and Model Training With Apache Iceberg and Mosaic Streaming

2025-06-10 · Data + AI Summit 2025 Watch

talk

by Zilong Zhou (ByteDance)

Data Management Iceberg

This session introduces ByteDance’s challenges in data management and model training, and addresses them by Magnus (enhanced Apache Iceberg) and Byted Streaming (customized Mosaic Streaming). Magnus uses Iceberg’s branch/tag to manage massive datasets/checkpoints efficiently. With enhanced metadata and a custom C++ data reader, Magnus achieves optimal sharding, shuffling and data loading. Flexible table migration, detailed metrics and built-in full-text indexes on Iceberg tables further ensure training reliability. When training with ultra-large datasets, ByteDance faced scalability and performance issues. Given Streaming's scalability in distributed training and good code structure, the team chose and customized it to resolve challenges like slow startup, high resource consumption, and limited data source compatibility. In this session, we will explore Magnus and Byted Streaming, discuss their enhancements and demonstrate how they enable efficient and robust distributed training.

How Databricks Powers Real-Time Threat Detection at Barracuda XDR

2025-06-10 · Data + AI Summit 2025 Watch

talk

by Alex Dangel (Barracuda Networks) , Merium Khalid (Barracuda Networks)

AI/ML Analytics CI/CD Cloud Computing Databricks Cyber Security Spark

As cybersecurity threats grow in volume and complexity, organizations must efficiently process security telemetry for best-in-class detection and mitigation. Barracuda’s XDR platform is redefining security operations by layering advanced detection methodologies over a broad range of supported technologies. Our vision is to deliver unparalleled protection through automation, machine learning and scalable detection frameworks, ensuring threats are identified and mitigated quickly. To achieve this, we have adopted Databricks as the foundation of our security analytics platform, providing greater control and flexibility while decoupling from traditional SIEM tools. By leveraging Lakeflow Declarative Pipelines, Spark Structured Streaming and detection-as-code CI/CD pipelines, we have built a real-time detection engine that enhances scalability, accuracy and cost efficiency. This session explores how Databricks is shaping the future of XDR through real-time analytics and cloud-native security.

Spark 4.0 and Delta 4.0 For Streaming Data

2025-06-10 · Data + AI Summit 2025 Watch

talk

by Bryce Bartmann (Shell)

AI/ML Delta JSON Python Spark

Real-time data is one of the most important datasets for any Data and AI Platform across any industry. Spark 4.0 and Delta 4.0 include new features that make ingestion and querying of real-time data better than ever before. Features such as: Python custom data sources for simple ingestion of streaming and batch time series data sources using Spark Variant types for managing variable data types and json payloads that are common in the real time domain Delta liquid clustering for simple data clustering without the overhead or complexity of partitioning In this presentation you will learn how data teams can leverage these latest features to build industry-leading, real-time data products using Spark and Delta and includes real world examples and metrics of the improvements they make in performance and processing of data in the real time space.

Harnessing Real-Time Data and AI for Retail Innovation

2025-06-10 · Data + AI Summit 2025 Watch

talk

by Lorenz Verzosa (Databricks) , Tristen Wentling (Databricks)

AI/ML Analytics Databricks GenAI

This talk explores using advanced data processing and generative AI techniques to revolutionize the retail industry. Using Databricks, we will discuss how cutting-edge technologies enable real-time data analysis and machine learning applications, creating a powerful ecosystem for large-scale, data-driven retail solutions. Attendees will gain insights into architecting scalable data pipelines for retail operations and implementing advanced analytics on streaming customer data. Discover how these integrated technologies drive innovation in retail, enhancing customer experiences, streamlining operations and enabling data-driven decision-making. Learn how retailers can leverage these tools to gain a competitive edge in the rapidly evolving digital marketplace, ultimately driving growth and adaptability in the face of changing consumer behaviors and market dynamics.

Kafka Forwarder: Simplifying Kafka Consumption at OpenAI

2025-06-10 · Data + AI Summit 2025 Watch

talk

by Jigar Bhati (Open AI)

Databricks Kafka LLM

At OpenAI, Kafka fuels real-time data streaming at massive scale, but traditional consumers struggle under the burden of partition management, offset tracking, error handling, retries, Dead Letter Queues (DLQ), and dynamic scaling — all while racing to maintain ultra-high throughput. As deployments scale, complexity multiplies. Enter Kafka Forwarder — a game-changing Kafka Consumer Proxy that flips the script on traditional Kafka consumption. By offloading client-side complexity and pushing messages to consumers, it ensures at-least-once delivery, automated retries, and seamless DLQ management via Databricks. The result? Scalable, reliable and effortless Kafka consumption that lets teams focus on what truly matters. Curious how OpenAI simplified self-service, high-scale Kafka consumption? Join us as we walk through the motivation, architecture and challenges behind Kafka Forwarder, and share how we structured the pipeline to seamlessly route DLQ data into Databricks for analysis.

A Comprehensive Guide to Streaming on the Data Intelligence Platform

2025-06-10 · Data + AI Summit 2025 Watch

talk

by Indrajit Roy (Databricks) , Ray Zhu (Databricks)

Spark

This session is repeated.Is stream processing the future? We think so — and we’re building it with you using the latest capabilities in Apache Spark™ Structured Streaming. If you're a power user, this session is for you: we’ll demo new advanced features, from state transformations to real-time mode. If you prefer simplicity, this session is also for you: we’ll show how Lakeflow Declarative Pipelines simplifies managing streaming pipelines. And if you’re somewhere in between, we’ve got you covered — we’ll explain when to use your own streaming jobs versus Lakeflow Declarative Pipelines.

Building Real-Time Sport Model Insights with Spark Structured Streaming

2025-06-10 · Data + AI Summit 2025 Watch

talk

by Aaron Hope (Draftkings) , Ethan Summers (Draftkings)

Databricks Kafka Spark

In the dynamic world of sports betting, precision and adaptability are key. Sports traders must navigate risk management, limitations of data feeds, and much more to prevent small model miscalculations from causing significant losses. To ensure accurate real-time pricing of hundreds of interdependent markets, traders provide key inputs such as player skill-level adjustments, whilst maintaining precise correlations. Black-box models aren’t enough— constant feedback loops drive informed, accurate decisions. Join DraftKings as we showcase how we expose real-time metrics from our simulation engine, to empower traders with deeper insights into how their inputs shape the model. Using Spark Structured Streaming, Kafka, and Databricks dashboards, we transform raw simulation outputs into actionable data. This transparency into our engines enables fine-grained control over pricing― leading to more accurate odds, a more efficient sportsbook, and an elevated customer experience.

SQL-Based ETL: Options for SQL-Only Databricks Development

2025-06-10 · Data + AI Summit 2025 Watch

talk

by Dustin Vannoy (Databricks)

Analytics Databricks dbt Delta ETL/ELT Git SQL SQLMesh

Using SQL for data transformation is a powerful way for an analytics team to create their own data pipelines. However, relying on SQL often comes with tradeoffs such as limited functionality, hard-to-maintain stored procedures or skipping best practices like version control and data tests. Databricks supports building high-performing SQL ETL workloads. Attend this session to hear how Databricks supports SQL for data transformation jobs as a core part of your Data Intelligence Platform. In this session we will cover 4 options to use Databricks with SQL syntax to create Delta tables: Lakeflow Declarative Pipelines: A declarative ETL option to simplify batch and streaming pipelines dbt: An open-source framework to apply engineering best practices to SQL based data transformations SQLMesh: an open-core product to easily build high-quality and high-performance data pipelines SQL notebooks jobs: a combination of Databricks Workflows and parameterized SQL notebooks

Unlock Your Use Cases: A Deep Dive on Structured Streaming’s New TransformWithState API

2025-06-10 · Data + AI Summit 2025 Watch

talk

by Angela Chu (Databricks) , Anish Shrigondekar (Databricks)

API Spark

Don’t you just hate telling your customers “No”? “No, I can’t get you the data that quickly”, or “No that logic isn’t possible to implement” really aren’t fun to say. But what if you had a tool that would allow you to implement those use cases? What if it was in a technology you were already familiar with — say, Spark Structured Streaming? There is a brand new arbitrary stateful operations API called TransformWithState, and after attending this deep dive you won’t have to say “No” anymore. During this presentation we’ll go through some real-world use cases and build them step-by-step. Everything from state variables, process vs. event time, watermarks, timers, state TTL, and even how you can initialize state with the checkpoint of another stream. Unlock your use cases with the power of Structured Streaming’s TransformWithState!

How an Open, Scalable and Secure Data Platform is Powering Quick Commerce Swiggy's AI

2025-06-10 · Data + AI Summit 2025 Watch

talk

by Vasan Vembu Srini (Databricks) , Akash Agarwal (Swiggy)

AI/ML Analytics Flink Data Lakehouse Databricks Delta GenAI Kafka Spark

Swiggy, India's leading quick commerce platform, serves ~13 million users across 653 cities, with 196,000 restaurant partners and 17,000 SKUs. To handle this scale, Swiggy developed a secure, scalable AI platform processing millions of predictions per second. The tech stack includes Apache Kafka for real-time streaming, Apache Spark on Databricks for analytics and ML, and Apache Flink for stream processing. The Lakehouse architecture on Delta ensures data reliability, while Unity Catalog enables centralized access control and auditing. These technologies power critical AI applications like demand forecasting, route optimization, personalized recommendations, predictive delivery SLAs, and generative AI use cases.Key Takeaway:This session explores building a data platform at scale, focusing on cost efficiency, simplicity, and speed, empowering Swiggy to seamlessly support millions of users and AI use cases.

Simplifying Data Pipelines With Lakeflow Declarative Pipelines: A Beginner’s Guide

2025-06-10 · Data + AI Summit 2025 Watch

talk

by Matt Jones (Databricks) , Brad Turnbaugh (84.51)

AI/ML Analytics Data Engineering ETL/ELT Kafka SQL

As part of the new Lakeflow data engineering experience, Lakeflow Declarative Pipelines makes it easy to build and manage reliable data pipelines. It unifies batch and streaming, reduces operational complexity and ensures dependable data delivery at scale — from batch ETL to real-time processing.Lakeflow Declarative Pipelines excels at declarative change data capture, batch and streaming workloads, and efficient SQL-based pipelines. In this session, you’ll learn how we’ve reimagined data pipelining with Lakeflow Declarative Pipelines, including: A brand new pipeline editor that simplifies transformations Serverless compute modes to optimize for performance or cost Full Unity Catalog integration for governance and lineage Reading/writing data with Kafka and custom sources Monitoring and observability for operational excellence “Real-time Mode” for ultra-low-latency streaming Join us to see how Lakeflow Declarative Pipelines powers better analytics and AI with reliable, unified pipelines.

Elevating Data Quality Standards With Databricks DQX

2025-06-10 · Data + AI Summit 2025 Watch

talk

by Marcin Wojtyczka (Databricks) , Neha Milak (Databricks)

Data Quality Databricks PySpark Python

Join us for an introductory session on Databricks DQX, a Python-based framework designed to validate the quality of PySpark DataFrames. Discover how DQX can empower you to proactively tackle data quality challenges, enhance pipeline reliability and make more informed business decisions with confidence. Traditional data quality tools often fall short by providing limited, actionable insights, relying heavily on post-factum monitoring, and being restricted to batch processing. DQX overcomes these limitations by enabling real-time quality checks at the point of data entry, supporting both batch and streaming data validation and delivering granular insights at the row and column level. If you’re seeking a simple yet powerful data quality framework that integrates seamlessly with Databricks, this session is for you.

The Hitchhiker's Guide to Delta Lake Streaming in an Agentic Universe

2025-06-10 · Data + AI Summit 2025 Watch

talk

by Scott Haines (Databricks)

AI/ML Data Engineering Delta LLM

As data engineering continues to evolve the shift from batch-oriented to streaming-first has become standard across the enterprise. The reality is these changes have been taking shape for the past decade — we just now also happen to be standing on the precipice of true disruption through automation, the likes of which we could only dream about before. Yes, AI Agents and LLMs are already a large part of our daily lives, but we (as data engineers) are ultimately on the frontlines ensuring that the future of AI is powered by consistent, just-in-time data — and Delta Lake is critical to help us get there. This session will provide you with best practices learned the hard way by one of the authors of The Delta Lake Definitive Guide including: Guide to writing generic applications as components Workflow automation tips and tricks Tips and tricks for Delta clustering (liquid, z-order, and classic) Future facing: Leveraging metadata for agentic pipelines and workflow automation

Ursa: Augment Your Lakehouse With Kafka-Compatible Data Streaming Capabilities

2025-06-10 · Data + AI Summit 2025 Watch

talk

by Gaurav Saxena (Automotive Industry) , Sijie Guo (StreamNative)

AI/ML API Data Lakehouse Delta GenAI Iceberg Kafka

As data architectures evolve to meet the demands of real-time GenAI applications, organizations increasingly need systems that unify streaming and batch processing while maintaining compatibility with existing tools. The Ursa Engine offers a Kafka-API-compatible data streaming engine built on Lakehouse (Iceberg and Delta Lake). Designed to seamlessly integrate with data lakehouse architectures, Ursa extends your lakehouse capabilities by enabling streaming ingestion, transformation and processing — using a Kafka-compatible interface. In this session, we will explore how Ursa Engine augments your existing lakehouses with Kafka-compatible capabilities. Attendees will gain insights into Ursa Engine architecture and real-world use cases of Ursa Engine. Whether you're modernizing legacy systems or building cutting-edge AI-driven applications, discover how Ursa can help you unlock the full potential of your data.

talk-data.com

Activity Trend

Top Events

Top Speakers

From Days to Seconds — Reducing Query Times on Large Geospatial Datasets by 99%

Real-Time Market Insights — Powering Optiver’s Live Trading Dashboard with Databricks Apps and Dash

SQL-First ETL: Building Easy, Efficient Data Pipelines With Lakeflow Declarative Pipelines

Unifying Human-Curated Data Ingestion and Real-Time Updates with Databricks Lakeflow Declarative Pipelines, Protobuf and BSR

Unlocking Streaming Power: How SEGA Wins With Lakeflow Declarative Pipelines

Why You Should Move to Lakeflow Declarative Pipelines Serverless

A Unified Solution for Data Management and Model Training With Apache Iceberg and Mosaic Streaming

How Databricks Powers Real-Time Threat Detection at Barracuda XDR

Spark 4.0 and Delta 4.0 For Streaming Data

Harnessing Real-Time Data and AI for Retail Innovation

Kafka Forwarder: Simplifying Kafka Consumption at OpenAI

A Comprehensive Guide to Streaming on the Data Intelligence Platform

Building Real-Time Sport Model Insights with Spark Structured Streaming

SQL-Based ETL: Options for SQL-Only Databricks Development

Unlock Your Use Cases: A Deep Dive on Structured Streaming’s New TransformWithState API

How an Open, Scalable and Secure Data Platform is Powering Quick Commerce Swiggy's AI

Simplifying Data Pipelines With Lakeflow Declarative Pipelines: A Beginner’s Guide

Elevating Data Quality Standards With Databricks DQX

The Hitchhiker's Guide to Delta Lake Streaming in an Agentic Universe

Ursa: Augment Your Lakehouse With Kafka-Compatible Data Streaming Capabilities