Data + AI Summit 2025

Transforming Bio-Pharma Manufacturing: Eli Lilly's Data-Driven Journey With Databricks

2025-06-11 Watch

talk

Abhijay Datta (Tredence) , SAUNAK DEBROY (Eli Lilly) , Wilfred Mascarenhas (Eli Lilly and Company)

AI/ML Analytics Data Modelling Databricks Fabric PySpark

Eli Lilly and Company, a leading bio-pharma company, is revolutionizing manufacturing with next-gen fully digital sites. Lilly and Tredence have partnered to establish a Databricks-powered Global Manufacturing Data Fabric (GMDF), laying the groundwork for transformative data products used by various personas at sites and globally. By integrating data from various manufacturing systems into a unified data model, GMDF has delivered actionable insights across several use cases such as batch release by exception, predictive maintenance, anomaly detection, process optimization and more. Our serverless architecture leverages Databricks Auto Loader for real-time data streaming, PySpark for automation and Unity Catalog for governance, ensuring seamless data processing and optimization. This platform is the foundation for data driven processes, self-service analytics, AI and more. This session will provide details on the data architecture and strategy and share a few use cases delivered.

Breaking Barriers: Building Custom Spark 4.0 Data Connectors with Python

2025-06-11 Watch

talk

Sourav Gulati (Databricks) , Ashish Saraswat (Databricks)

API Java Python Scala Spark Data Streaming

Building a custom Spark data source connector once required Java or Scala expertise, making it complex and limiting. This left many proprietary data sources without public SDKs disconnected from Spark. Additionally, data sources with Python SDKs couldn't harness Spark’s distributed power. Spark 4.0 changes this with a new Python API for data source connectors, allowing developers to build fully functional connectors without Java or Scala. This unlocks new possibilities, from integrating proprietary systems to leveraging untapped data sources. Supporting both batch and streaming, this API makes data ingestion more flexible than ever. In this talk, we’ll demonstrate how to build a Spark connector for Excel using Python, showcasing schema inference, data reads/writes and streaming support. Whether you're a data engineer or Spark enthusiast, you’ll gain the knowledge to integrate Spark with any data source — entirely in Python.

How Blue Origin Accelerates Innovation With Databricks and AWS GovCloud

2025-06-11 Watch

talk

Seths Sethuraman (Blue Origin) , Filippo Seracini (Databricks)

AI/ML Analytics AWS Databricks Delta GenAI

Blue Origin is revolutionizing space exploration with a mission-critical data strategy powered by Databricks on AWS GovCloud. Learn how they leverage Databricks to meet ITAR and FedRAMP High compliance, streamline manufacturing and accelerate their vision of a 24/7 factory. Key use cases include predictive maintenance, real-time IoT insights and AI-driven tools that transform CAD designs into factory instructions. Discover how Delta Lake, Structured Streaming and advanced Databricks functionalities like Unity Catalog enable real-time analytics and future-ready infrastructure, helping Blue Origin stay ahead in the race to adopt generative AI and serverless solutions.

Inscape Smart TV Data: Unlocking Consumption and Competitive Intelligence

2025-06-11 Watch

lightning_talk

Rich Guinness (Vizio Inscape)

Databricks Data Streaming

With VIZIO's Inscape viewership data now available in the Databricks marketplace, our expansive dataset has never been easier to access. With real-time availability, flexible integrations, and secure, governed sharing, it's built for action.Join our team as we explore the full depth of this comprehensive data across both linear and streaming TV - showcasing real-world use cases like measuring the incremental reach of streaming or matching to 1st/3rd party data for ROI analyses. We will review our competitive intelligence through a share-of-voice analysis to provide the seamless steps to success.This session will show you how to turn Inscape data into a strategic advantage.

Franchise IP and Data Governance at Krafton: Driving Cost Efficiency and Scalability

2025-06-11 Watch

lightning_talk

hwaeium yeom (KRAFTON)

Analytics Data Governance Databricks Delta S3 Data Streaming

Join us as we explore how KRAFTON optimized data governance for PUBG IP, enhancing cost efficiency and scalability. KRAFTON operates a massive data ecosystem, processing tens of terabytes daily. As real-time analytics demands increased, traditional Batch-based processing faced scalability challenges. To address this, we redesigned data pipelines and governance models, improving performance while reducing costs. Transitioned to real-time pipelines (batch to streaming) Optimized workload management (reducing all-purpose clusters, increasing Jobs usage) Cut costs by tens of thousands monthly (up to 75%) Enhanced data storage efficiency (lower S3 costs, Delta Tables) Improved pipeline stability (Medallion Architecture) Gain insights into how KRAFTON scaled data operations, leveraging real-time analytics and cost optimization for high-traffic games. Learn more: https://www.databricks.com/customers/krafton

Hands-on Learning: AI-Powered Data Engineering with Lakeflow: Techniques for Modern Data Professionals

2025-06-11

talk

Frank Munz (Databricks)

AI/ML Data Engineering Data Governance Databricks GenAI GitHub

This introductory workshop caters to data engineers seeking hands-on experience and data architects looking to deepen their knowledge. The workshop is structured to provide a solid understanding of the following data engineering and streaming concepts: Introduction to Lakeflow and the Data Intelligence Platform Getting started with Lakeflow Declarative Pipelines for declarative data pipelines in SQL using Streaming Tables and Materialized Views Mastering Databricks Workflows with advanced control flow and triggers Understanding serverless compute Data governance and lineage with Unity Catalog Generative AI for Data Engineers: Genie and Databricks Assistant We believe you can only become an expert if you work on real problems and gain hands-on experience. Therefore, we will equip you with your own lab environment in this workshop and guide you through practical exercises like using GitHub, ingesting data from various sources, creating batch and streaming data pipelines, and more.

Scaling Real-Time Fraud Detection With Databricks: Lessons From DraftKings

2025-06-11 Watch

talk

Greg Von Pless (DraftKings) , Monika Hristova (Draftkings)

AI/ML Databricks PySpark Data Streaming

At DraftKings, ensuring secure, fair gaming requires detecting fraud in real time with both speed and precision. In this talk, we’ll share how Databricks powers our fraud detection pipeline, integrating real-time streaming, machine learning and rule-based detection within a PySpark framework. Our system enables rapid model training, real-time inference and seamless feature transformation across historical and live data. We use shadow mode to test models and rules in live environments before deployment. Collaborating with Databricks, we push online feature store performance and enhance real-time PySpark capabilities. We'll cover PySpark-based feature transformations, real-time inference, scaling challenges and our migration from a homegrown system to Databricks. This session is for data engineers and ML practitioners optimizing real-time AI workloads, featuring a deep dive, code snippets and lessons from building and scaling fraud detection.

Sponsored by: AWS | Ripple: Well-Architected Data & AI Platforms - AWS and Databricks in Harmony

Delta and Databricks as a Performant Exabyte-Scale Application Backend

2025-06-11 Watch

lightning_talk

Scott Schenkein (Capital One Financial)

Databricks Delta DWH ETL/ELT NoSQL Cyber Security

The Delta Lake architecture promises to provide a single, highly functional, and high-scale copy of data that can be leveraged by a variety of tools to satisfy a broad range of use cases. To date, most use cases have focused on interactive data warehousing, ETL, model training, and streaming. Real-time access is generally delegated to costly and sometimes difficult-to-scale NoSQL, indexed storage, and domain-specific specialty solutions, which provide limited functionality compared to Spark on Delta Lake. In this session, we will explore the Delta data-skipping and optimization model and discuss how Capital One leveraged it along with Databricks photon and Spark Connect to implement a real-time web application backend. We’ll share how we built a highly-functional and performant security information and event management user experience (SIEM UX) that is cost effective.

Real-Time Analytics Pipeline for IoT Device Monitoring and Reporting

2025-06-11 Watch

talk

Nayan Sharma (CKDelta) , Padraic Kirrane (CK Delta)

Analytics API Cloud Computing Cloud Storage Data Quality Databricks

This session will show how we implemented a solution to support high-frequency data ingestion from smart meters. We implemented a robust API endpoint that interfaces directly with IoT devices. This API processes messages in real time from millions of distributed IoT devices and meters across the network. The architecture leverages cloud storage as a landing zone for the raw data, followed by a streaming pipeline built on Lakeflow Declarative Pipelines. This pipeline implements a multi-layer medallion architecture to progressively clean, transform and enrich the data. The pipeline operates continuously to maintain near real-time data freshness in our gold layer tables. These datasets connect directly to Databricks Dashboards, providing stakeholders with immediate insights into their operational metrics. This solution demonstrates how modern data architecture can handle high-volume IoT data streams while maintaining data quality and providing accessible real-time analytics for business users.

Building Real-time Trading Dashboards with Lakeflow Declarative Pipelines, Serverless OLTP and Databricks Apps

2025-06-10

talk

Matt Slack (Databricks) , Matthew Moorcroft (Databricks)

Databricks Java Data Streaming

Barclays Post Trade real-time trade monitoring platform was historically built on a complex set of legacy technologies including Java, Solace, and custom micro-services.This session will demonstrate how the power of Lakeflow Declarative Pipelines' new real-time mode, in conjunction with the foreach_batch_sink, can enable simple, cost-effective streaming pipelines that can load high volumes of data into Databricks new Serverless OLTP database with very low latency.Once in our OLTP database, this can be used to update real-time trading dashboards, securely hosted in Databricks Apps, with the latest stock trades - enabling better, more responsive decision-making and alerting.The session will walk-through the architecture, and demonstrate how simple it is to create and manage the pipelines and apps within the Databricks environment.

Real-Time Mode Technical Deep Dive: How We Built Sub-300 Millisecond Streaming Into Apache Spark™

2025-06-10 Watch

talk

Siying Dong (Databricks) , Jerry Peng (Databricks)

AI/ML Databricks Spark SQL Data Streaming

Real-time mode is a new low-latency execution mode for Apache Spark™ Structured Streaming. It can consistently provide p99 latencies less than 300 milliseconds for a broad set of stateless and stateful streaming queries. Our talk focuses on the technical aspects of making this possible in Spark. We’ll dive into the core architecture that enables these dramatic latency improvements, including a concurrent stage scheduler and a non-blocking shuffle. We’ll explore how we maintained Spark’s fault-tolerance guarantees, and we’ll also share specific optimizations we made to our streaming SQL operators. These architectural improvements have already enabled Databricks customers to build workloads with latencies up to 10x lower than before. Early adopters in our Private Preview have successfully implemented real-time enrichment pipelines and feature engineering for machine learning — use cases that were previously impossible at these latencies.

Streaming Meets Governance: Building AI-Ready Tables With Confluent Tableflow and Unity Catalog

2025-06-10 Watch

talk

Kasun Indrasiri Gamage (Confluent) , Victoria Bukta (Databricks)

AI/ML Analytics Data Governance Databricks Delta Kafka

Learn how Databricks and Confluent are simplifying the path from real-time data to governed, analytics- and AI-ready tables. This session will cover how Confluent Tableflow automatically materializes Kafka topics into Delta tables and registers them with Unity Catalog — eliminating the need for custom streaming pipelines. We’ll walk through how this integration helps data engineers reduce ingestion complexity, enforce data governance and make real-time data immediately usable for analytics and AI.

Data Intelligence for Cybersecurity Forum: Insights From SAP, Anvilogic, Capital One, and Wiz

2025-06-10 Watch

talk

Jiong Liu (Wiz) , Hemanth Varma Kusampudi (SAP) , Anil Chamarthy (Capital One) , Mackenzie Kyle (Anvilogic)

AI/ML Cloud Computing Databricks Delta SAP Cyber Security

Join cybersecurity leaders from SAP, Anvilogic, Capital One, Wiz, and Databricks to explore how modern data intelligence is transforming security operations. Discover how SAP adopted a modular, AI-powered detection engineering lifecycle using Anvilogic on Databricks. Learn how Capital One built a detection and correlation engine leveraging Delta Lake, Apache Spark Streaming, and Databricks to process millions of cybersecurity events per second. Finally, see how Wiz and Databricks’ partnership enhances cloud security with seamless threat visibility. Through expert insights and live demos, gain strategies to build scalable, efficient cybersecurity powered by data and AI.

Sponsored by: Domo | Orchestrating Fleet Intelligence with AI Agents and Real-Time IoT With Databricks + DOMO

From Days to Seconds — Reducing Query Times on Large Geospatial Datasets by 99%

2025-06-10 Watch

talk

Chris Crawford (Databricks) , Hobson Bryan (Global Water Security Center)

CI/CD Data Lakehouse Databricks Git Cyber Security Spark

The Global Water Security Center translates environmental science into actionable insights for the U.S. Department of Defense. Prior to incorporating Databricks, responding to these requests required querying approximately five hundred thousand raster files representing over five hundred billion points. By leveraging lakehouse architecture, Databricks Auto Loader, Spark Streaming, Databricks Spatial SQL, H3 geospatial indexing and Databricks Liquid Clustering, we were able to drastically reduce our “time to analysis” from multiple business days to a matter of seconds. Now, our data scientists execute queries on pre-computed tables in Databricks, resulting in a “time to analysis” that is 99% faster, giving our teams more time for deeper analysis of the data. Additionally, we’ve incorporated Databricks Workflows, Databricks Asset Bundles, Git and Git Actions to support CI/CD across workspaces. We completed this work in close partnership with Databricks.

Real-Time Market Insights — Powering Optiver’s Live Trading Dashboard with Databricks Apps and Dash

2025-06-10 Watch

talk

Huy Nguyen (Optiver)

Dashboard Databricks Data Streaming

In the fast-paced world of trading, real-time insights are critical for making informed decisions. This presentation explores how Optiver, a leading high-frequency trading firm, harnesses Databricks apps to power its live trading dashboards. The technology enables traders to analyze market data, detect patterns and respond instantly. In this talk, we will showcase how our system leverages Databricks’ scalable infrastructures such as Structured Streaming to efficiently handle vast streams of financial data while ensuring low-latency performance. In addition, we will show how the integration of Databricks apps with Dash has empowered traders to rapidly develop and deploy custom dashboards, minimizing dependency on developers. Attendees will gain insights into our architecture, data processing techniques and lessons learned in integrating Databricks apps with Dash in order to drive rapid, data-driven trading decisions.

SQL-First ETL: Building Easy, Efficient Data Pipelines With Lakeflow Declarative Pipelines

2025-06-10 Watch

talk

Paul Lappas (Databricks) , Ritwik Yadav (Databricks) , Meixian Li (Databricks)

Databricks dbt ETL/ELT SQL Data Streaming

This session explores how SQL-based ETL can accelerate development, simplify maintenance and make data transformation more accessible to both engineers and analysts. We'll walk through how Databricks Lakeflow Declarative Pipelines and Databricks SQL warehouse support building production-grade pipelines using familiar SQL constructs.Topics include: Using streaming tables for real-time ingestion and processing Leveraging materialized views to deliver fast, pre-computed datasets Integrating with tools like dbt to manage batch and streaming workflows at scale By the end of the session, you’ll understand how SQL-first approaches can streamline ETL development and support both operational and analytical use cases.

Unifying Human-Curated Data Ingestion and Real-Time Updates with Databricks Lakeflow Declarative Pipelines, Protobuf and BSR

2025-06-10

talk

Dwight Whitlock (Clinician Nexus)

Data Governance Databricks Kafka Protobuf Data Streaming

Red Stapler is a streaming-native system on Databricks that merges file-based ingestion and real-time user edits into one Lakeflow Declarative Pipelines for near real-time feedback. Protobuf definitions, managed in the Buf Schema Registry (BSR), govern schema and data-quality rules, ensuring backward compatibility. All records — valid or not — are stored in an SCD Type 2 table, capturing every version for full history and immediate quarantine views of invalid data. This unified approach boosts data governance, simplifies auditing and streamlines error fixes.Running on Lakeflow Declarative Pipelines Serverless and the Kafka-compatible Bufstream keeps costs low by scaling down to zero when idle. Red Stapler’s configuration-driven Protobuf logic adapts easily to evolving survey definitions without risking production. The result is consistent validation, quick updates and a complete audit trail — all critical for trustworthy, flexible data pipelines.

Unlocking Streaming Power: How SEGA Wins With Lakeflow Declarative Pipelines

2025-06-10

talk

Felix Baker (SEGA Europe Limited) , Craig Porteous (Advancing Analytics)

Analytics Data Quality Microsoft Data Streaming

Streaming data is hard and costly — that's the default opinion, but it doesn’t have to be.In this session, discover how SEGA simplified complex streaming pipelines and turned them into a competitive edge. SEGA sees over 40,000 events per second. That's no easy task, but enabling personalised gaming experiences for over 50 million gamers drives a huge competitive advantage. If you’re wrestling with streaming challenges, this talk is your next checkpoint.We’ll unpack how Lakeflow Declarative Pipelines helped SEGA, from automated schema evolution and simple data quality management to seamless streaming reliability. Learn how Lakeflow Declarative Pipelines drives value by transforming chaos emeralds into clarity, delivering results for a global gaming powerhouse. We'll step through the architecture, approach and challenges we overcame.Join Craig Porteous, Microsoft MVP from Advancing Analytics, and Felix Baker, Head of Data Services at SEGA Europe, for a fast-paced, hands-on journey into Lakeflow Declarative Pipelines’ unique powers.

Why You Should Move to Lakeflow Declarative Pipelines Serverless

2025-06-10 Watch

lightning_talk

Nandini N (Databricks)

ETL/ELT Data Streaming

Lakeflow Declarative Pipelines Serverless offers a range of benefits that make it an attractive option for organizations looking to optimize their ETL (Extract, Transform, Load) processes.Key benefits of Lakeflow Declarative Pipelines Serverless: Automatic infrastructure management Unified batch and streaming Cost and performance optimization Simplified configuration Granular observability By moving to Lakeflow Declarative Pipelines Serverless, organizations can achieve faster, more reliable, and cost-effective data pipeline management, ultimately driving better business insights and outcomes.

A Unified Solution for Data Management and Model Training With Apache Iceberg and Mosaic Streaming

2025-06-10 Watch

talk

Zilong Zhou (ByteDance)

Data Management Iceberg Data Streaming

This session introduces ByteDance’s challenges in data management and model training, and addresses them by Magnus (enhanced Apache Iceberg) and Byted Streaming (customized Mosaic Streaming). Magnus uses Iceberg’s branch/tag to manage massive datasets/checkpoints efficiently. With enhanced metadata and a custom C++ data reader, Magnus achieves optimal sharding, shuffling and data loading. Flexible table migration, detailed metrics and built-in full-text indexes on Iceberg tables further ensure training reliability. When training with ultra-large datasets, ByteDance faced scalability and performance issues. Given Streaming's scalability in distributed training and good code structure, the team chose and customized it to resolve challenges like slow startup, high resource consumption, and limited data source compatibility. In this session, we will explore Magnus and Byted Streaming, discuss their enhancements and demonstrate how they enable efficient and robust distributed training.

How Databricks Powers Real-Time Threat Detection at Barracuda XDR

2025-06-10 Watch

talk

Alex Dangel (Barracuda Networks) , Merium Khalid (Barracuda Networks)

AI/ML Analytics CI/CD Cloud Computing Databricks Cyber Security

As cybersecurity threats grow in volume and complexity, organizations must efficiently process security telemetry for best-in-class detection and mitigation. Barracuda’s XDR platform is redefining security operations by layering advanced detection methodologies over a broad range of supported technologies. Our vision is to deliver unparalleled protection through automation, machine learning and scalable detection frameworks, ensuring threats are identified and mitigated quickly. To achieve this, we have adopted Databricks as the foundation of our security analytics platform, providing greater control and flexibility while decoupling from traditional SIEM tools. By leveraging Lakeflow Declarative Pipelines, Spark Structured Streaming and detection-as-code CI/CD pipelines, we have built a real-time detection engine that enhances scalability, accuracy and cost efficiency. This session explores how Databricks is shaping the future of XDR through real-time analytics and cloud-native security.

Spark 4.0 and Delta 4.0 For Streaming Data

2025-06-10 Watch

talk

Bryce Bartmann (Shell)

AI/ML Delta JSON Python Spark Data Streaming

Real-time data is one of the most important datasets for any Data and AI Platform across any industry. Spark 4.0 and Delta 4.0 include new features that make ingestion and querying of real-time data better than ever before. Features such as: Python custom data sources for simple ingestion of streaming and batch time series data sources using Spark Variant types for managing variable data types and json payloads that are common in the real time domain Delta liquid clustering for simple data clustering without the overhead or complexity of partitioning In this presentation you will learn how data teams can leverage these latest features to build industry-leading, real-time data products using Spark and Delta and includes real world examples and metrics of the improvements they make in performance and processing of data in the real time space.

Harnessing Real-Time Data and AI for Retail Innovation

2025-06-10 Watch

talk

Lorenz Verzosa (Databricks) , Tristen Wentling (Databricks)

AI/ML Analytics Databricks GenAI Data Streaming

This talk explores using advanced data processing and generative AI techniques to revolutionize the retail industry. Using Databricks, we will discuss how cutting-edge technologies enable real-time data analysis and machine learning applications, creating a powerful ecosystem for large-scale, data-driven retail solutions. Attendees will gain insights into architecting scalable data pipelines for retail operations and implementing advanced analytics on streaming customer data. Discover how these integrated technologies drive innovation in retail, enhancing customer experiences, streamlining operations and enabling data-driven decision-making. Learn how retailers can leverage these tools to gain a competitive edge in the rapidly evolving digital marketplace, ultimately driving growth and adaptability in the face of changing consumer behaviors and market dynamics.

talk-data.com

Top Topics

Top Speakers

Transforming Bio-Pharma Manufacturing: Eli Lilly's Data-Driven Journey With Databricks

Breaking Barriers: Building Custom Spark 4.0 Data Connectors with Python

How Blue Origin Accelerates Innovation With Databricks and AWS GovCloud

Inscape Smart TV Data: Unlocking Consumption and Competitive Intelligence

Franchise IP and Data Governance at Krafton: Driving Cost Efficiency and Scalability

Hands-on Learning: AI-Powered Data Engineering with Lakeflow: Techniques for Modern Data Professionals

Scaling Real-Time Fraud Detection With Databricks: Lessons From DraftKings

Sponsored by: AWS | Ripple: Well-Architected Data & AI Platforms - AWS and Databricks in Harmony

Delta and Databricks as a Performant Exabyte-Scale Application Backend

Real-Time Analytics Pipeline for IoT Device Monitoring and Reporting

Building Real-time Trading Dashboards with Lakeflow Declarative Pipelines, Serverless OLTP and Databricks Apps

Real-Time Mode Technical Deep Dive: How We Built Sub-300 Millisecond Streaming Into Apache Spark™

Streaming Meets Governance: Building AI-Ready Tables With Confluent Tableflow and Unity Catalog

Data Intelligence for Cybersecurity Forum: Insights From SAP, Anvilogic, Capital One, and Wiz

Sponsored by: Domo | Orchestrating Fleet Intelligence with AI Agents and Real-Time IoT With Databricks + DOMO

From Days to Seconds — Reducing Query Times on Large Geospatial Datasets by 99%

Real-Time Market Insights — Powering Optiver’s Live Trading Dashboard with Databricks Apps and Dash

SQL-First ETL: Building Easy, Efficient Data Pipelines With Lakeflow Declarative Pipelines

Unifying Human-Curated Data Ingestion and Real-Time Updates with Databricks Lakeflow Declarative Pipelines, Protobuf and BSR

Unlocking Streaming Power: How SEGA Wins With Lakeflow Declarative Pipelines

Why You Should Move to Lakeflow Declarative Pipelines Serverless

A Unified Solution for Data Management and Model Training With Apache Iceberg and Mosaic Streaming

How Databricks Powers Real-Time Threat Detection at Barracuda XDR

Spark 4.0 and Delta 4.0 For Streaming Data

Harnessing Real-Time Data and AI for Retail Innovation