Data + AI Summit 2025

Breaking Up With Spark Versions: Client APIs, AI-Powered Automatic Updates, and Dependency Management for Databricks Serverless

2025-06-12 Watch

talk

Justin Breese (Databricks)

AI/ML API Databricks Spark

This session explains how we've made our Apache Spark™ versionless for end users by introducing a stable client API, environment versioning and automatic remediation. These capabilities have enabled auto-upgrade of hundreds of millions of workloads with minimal disruption for Serverless Notebooks and Jobs. We'll also introduce a new approach to dependency management using environments. Admins will learn how to speed up package installation with Default Base Environments, and users will see how to manage custom environments for their own workloads.

Iceberg Geo Type: Transforming Geospatial Data Management at Scale

2025-06-12 Watch

talk

Szehon Ho (Databricks) , Jia Yu (Wherobots Inc.)

Analytics Data Management DWH Iceberg Spark

The Apache Iceberg™ community is introducing native geospatial type support, addressing key challenges in managing geospatial data at scale, including fragmented formats and inefficiencies in storing large spatial datasets. This talk will delve into the origins of the Iceberg geo type, its specification design and future goals. We will examine the impact on both the geospatial and Iceberg communities, in introducing a standard data warehouse storage layer to the geospatial community, and enabling optimized geospatial analytics for Iceberg users. We will also present a live demonstration of the Iceberg geo data type with Apache Sedona™ and Apache Spark™, showcasing how it simplifies and accelerates geospatial analytics workflows and queries. Finally, we will also provide an in-depth look at its current capabilities and outline the roadmap for future developments, and offer a perspective on its role in advancing geospatial data management in the industry.

Supercharging Sales Intelligence: Processing Billions of Events via Structured Streaming

2025-06-12 Watch

talk

Anurag Bharati (DigiCert) , Nikita Raje (DigiCert)

API Databricks Delta Cyber Security Spark Data Streaming

DigiCert is a digital security company that provides digital certificates, encryption and authentication services and serves 88% of the Fortune 500, securing over 28 billion web connections daily. Our project aggregates and analyzes certificate transparency logs via public APIs to provide comprehensive market and competitive intelligence. Instead of relying on third-party providers with limited data, our project gives full control, deeper insights and automation. Databricks has helped us reliably poll public APIs in a scalable manner that fetches millions of events daily, deduplicate and store them in our Delta tables. We specifically use Spark for parallel processing, structured streaming for real-time ingestion and deduplication, Delta tables for data reliability, pools and jobs to ensure our costs are optimized. These technologies help us keep our data fresh, accurate and cost effective. This data has helped our sales team with real-time intelligence, ensuring DigiCert's success.

Kill Bill-ing? Revenge is a Dish Best Served Optimized with GenAI

2025-06-12 Watch

lightning_talk

Abdul Furkhan (Sportsbet)

AI/ML Cloud Computing Data Engineering Databricks GenAI Spark

In an era where cloud costs can spiral out of control, Sportsbet achieved a remarkable 49% reduction in Total Cost of Ownership (TCO) through an innovative AI-powered solution called 'Kill Bill.' This presentation reveals how we transformed Databricks' consumption-based pricing model from a challenge into a strategic advantage through an intelligent automation and optimization. Understand how to use GenAI to reduce Databricks TCO Leverage generative AI within Databricks solutions enables automated analysis of cluster logs, resource consumption, configurations, and codebases to provide Spark optimization suggestions Create AI agentic workflows by integrating Databricks' AI tools and Databricks Data Engineering tools Review a case study demonstrating how Total Cost of Ownership was reduced in practice. Attendees will leave with a clear understanding of how to implement AI within Databricks solutions to address similar cost challenges in their environments.

Sponsored by: definity | How You Could Be Saving 50% of Your Spark Costs

What’s New in Apache Spark™ 4.0?

2025-06-12 Watch

talk

Wenchen Fan (Databricks) , Daniel Tenedorio (Databricks)

AI/ML API Java Python Scala Spark

Join this session for a concise tour of Apache Spark™ 4.0’s most notable enhancements: SQL features: ANSI by default, scripting, SQL pipe syntax, SQL UDF, session variable, view schema evolution, etc. Data type: VARIANT type, string collation Python features: Python data source, plotting API, etc. Streaming improvements: State store data source, state store checkpoint v2, arbitrary state v2, etc. Spark Connect improvements: More API coverage, thin client, unified Scala interface, etc. Infrastructure: Better error message, structured logging, new Java/Scala version support, etc. Whether you’re a seasoned Spark user or new to the ecosystem, this talk will prepare you to leverage Spark 4.0’s latest innovations for modern data and AI pipelines.

Founder discussion: Matei on UC, Data Intelligence and AI Governance

2025-06-12 Watch

talk

Matei Zaharia (Databricks)

AI/ML Databricks Delta LLM Spark

Matei is a legend of open source: he started the Apache Spark project in 2009, co-founded Databricks, and worked on other widely used data and AI software, including MLflow, Delta Lake, and Dolly. His most recent research is about combining large language models (LLMs) with external data sources, such as search systems, and improving their efficiency and result quality. This will be a conversation coverering the latest and greatest of UC, Data Intelligence, AI Governance, and more.

Get the Most of Your Delta Lake

2025-06-12 Watch

lightning_talk

Youssef Mrini (Databricks)

Analytics Data Lakehouse Data Management Delta Spark

Unlock the full potential of Delta Lake, the open-source storage framework for Apache Spark, with this session focused on its latest and most impactful features. Discover how capabilities like Time Travel, Column Mapping, Deletion Vectors, Liquid Clustering, UniForm interoperability, and Change Data Feed (CDF) can transform your data architecture. Learn not just what these features do, but when and how to use them to maximize performance, simplify data management, and enable advanced analytics across your lakehouse environment.

Incremental Iceberg Table Replication at Scale

2025-06-12 Watch

talk

Hongyue Hongyue (Self-Employed) , Szehon Ho (Databricks)

Iceberg Spark

Apache Iceberg is a popular table format for managing large analytical datasets. But replicating iceberg tables at scale can be a daunting task — especially when dealing with its hierarchical metadata. In this talk, we present an end-to-end workflow for replicating Apache Iceberg tables, leveraging Apache Spark to ensure that backup tables remain identical to their source counterparts. More excitingly, we have contributed these libraries back to the open-source community. Attendees will gain a comprehensive understanding of how to set up replication workflows for Iceberg tables, as well as practical guidance on how to manage and maintain replicated datasets at scale. This talk is ideal for data engineers, platform architects and practitioners looking to apply replication and disaster recovery for Apache Iceberg in complex data ecosystems.

Introducing Simplified State Tracking in Apache Spark™ Structured Streaming

2025-06-12 Watch

lightning_talk

Craig Lukasik (Databricks)

API Spark Data Streaming

This presentation will review the new change feed and snapshot capabilities in Apache Spark™ Structured Streaming’s State Reader API. The State Reader API enables users to access and analyze Structured Streaming's internal state data. Readers will learn how to leverage the new features to debug, troubleshoot and analyze state changes efficiently, making streaming workloads easier to manage at scale.

Using Delta-rs and Delta-Kernel-rs to Serve CDC Feeds

2025-06-12 Watch

talk

Stephen Carman (Databricks) , Oussama Saoudi (Databricks)

Databricks Delta Python Rust Spark

Change data feeds are a common tool for synchronizing changes between tables and performing data processing in a scalable fashion. Serverless architectures offer a compelling solution for organizations looking to avoid the complexity of managing infrastructure. But how can you bring CDFs into a serverless environment? In this session, we'll explore how to integrate Change Data Feeds into serverless architectures using Delta-rs and Delta-kernel-rs—open-source projects that allow you to read Delta tables and their change data feeds in Rust or Python. We’ll demonstrate how to use these tools with Lakestore’s serverless platform to easily stream and process changes. You’ll learn how to: Leverage Delta tables and CDFs in serverless environments Utilize Databricks and Unity Catalog without needing Apache Spark

Creating a Custom PySpark Stream Reader with PySpark 4.0

2025-06-11 Watch

lightning_talk

Skyler Myers (Entrada)

Databricks Delta Java Kafka MySQL PySpark

PySpark supports many data sources out of the box, such as Apache Kafka, JDBC, ODBC, Delta Lake, etc. However, some older systems, such as systems that use JMS protocol, are not supported by default and require considerable extra work for developers to read from them. One such example is ActiveMQ for streaming. Traditionally, users of ActiveMQ have to use a middle-man in order to read the stream with Spark (such as writing to a MySQL DB using Java code and reading that table with Spark JDBC). With PySpark 4.0’s custom data sources (supported in DBR 15.3+) we are able to cut out the middle-man processing using batch or Spark Streaming and consume the queues directly from PySpark, saving developers considerable time and complexity in getting source data into your Delta Lake and governed by Unity Catalog and orchestrated with Databricks Workflows.

Sponsored by: dbt Labs | Leveling Up Data Engineering at Riot: How We Rolled Out dbt and Transformed the Developer Experience

Declarative Pipelines: What’s Next for the Apache Spark Ecosystem

2025-06-11 Watch

talk

Michael Armbrust (Databricks) , Sandy Ryza (Databricks)

Spark

Lakeflow Declarative Pipelines has made it dramatically easier to build production-grade Spark pipelines, using a framework that abstracts away orchestration and complexity. It’s become a go-to solution for teams who want reliable, maintainable pipelines without reinventing the wheel.But we’re just getting started. In this session, we’ll take a step back and share a broader vision for the future of Spark Declarative Pipelines — one that opens the door to a new level of openness, standardization and community momentum.We’ll cover the core concepts behind Declarative Pipelines, where the architecture is headed, and what this shift means for both existing Lakeflow users and Spark engineers building procedural code. Don’t miss this session — we’ll be sharing something new that sets the direction for what comes next.

Extending the Lakehouse: Power Interoperable Compute With Unity Catalog Open APIs

2025-06-11 Watch

talk

Tathagata Das (Databricks) , Michelle Leon (Databricks)

Flink API Data Lakehouse DuckDB Iceberg Cyber Security

The lakehouse is built for storage flexibility, but what about compute? In this session, we’ll explore how Unity Catalog enables you to connect and govern multiple compute engines across your data ecosystem. With open APIs and support for the Iceberg REST Catalog, UC lets you extend access to engines like Trino, DuckDB, and Flink while maintaining centralized security, lineage, and interoperability. We will show how you can get started today working with engines like Apache Spark and Starburst to read and write to UC managed tables with some exciting demos. Learn how to bring flexibility to your compute layer—without compromising control.

Race to Real-Time: Low-Latency Streaming ETL Meets Next-Gen Databricks OLTP-DB

2025-06-11 Watch

lightning_talk

Irfan Elahi (Databricks)

Kinesis Databricks ETL/ELT postgresql Spark Data Streaming

In today’s digital economy, real-time insights and rapid responsiveness are paramount to delivering exceptional user experiences and lowering TCO. In this session, discover a pioneering approach that leverages a low-latency streaming ETL pipeline built with Spark Structured Streaming and Databricks’ new OLTP-DB—a serverless, managed Postgres offering designed for transactional workloads. Validated in a live customer scenario, this architecture achieves sub-2 second end-to-end latency by seamlessly ingesting streaming data from Kinesis and merging it into OLTP-DB. This breakthrough not only enhances performance and scalability but also provides a replicable blueprint for transforming data pipelines across various verticals. Join us as we delve into the advanced optimization techniques and best practices that underpin this innovation, demonstrating how Databricks’ next-generation solutions can revolutionize real-time data processing and unlock a myriad of new use cases in data landscape.

Spark Right-Sizing: Saving Thousands of PBHrs of Compute at LinkedIn

2025-06-11 Watch

lightning_talk

Shreyesh Arangath (LinkedIn)

Spark

At LinkedIn, we manage over 400,000 daily Spark applications consuming 200+ PBHrs of compute daily. To address the challenges posed by manual configuration of Spark's memory tuning options, which led to low memory utilization and frequent OOM errors, we developed an automated Spark executor memory right-sizing system. Our approach, utilizing a policy-based system with nearline and real-time feedback loops, automates memory tuning, leading to more efficient resource allocation, improved user productivity and increased job reliability. By leveraging historical data and real-time error classification, we dynamically adjust memory, significantly narrowing the gap between allocated and utilized resources while reducing failures. This initiative has achieved a 13% increase in memory utilization and a 90% drop in OOM-related job failures, saving us 1000s of PBHrs of compute every year.

Delivering Sub-Second Latency for Operational Workloads on Databricks

2025-06-11 Watch

talk

Karthikeyan Ramasamy (Databricks) , Jerry Peng (Databricks)

Databricks Spark Data Streaming

As enterprise streaming adoption accelerates, more teams are turning to real-time processing to support operational workloads that require sub-second response times. To address this need, Databricks introduced Project Lightspeed in 2022, which recently delivered Real-Time Mode in Apache Spark™ Structured Streaming. This new mode achieves consistent p99 latencies under 300ms for a wide range of stateless and stateful streaming queries. In this session, we’ll define what constitutes an operational use case, outline typical latency requirements and walk through how to meet those SLAs using Real-Time Mode in Structured Streaming.

Empowering the Warfighter With AI

2025-06-11 Watch

talk

Teneika Askew (Navy)

AI/ML Data Management Databricks Delta Spark

The new Budget Execution Validation process has transformed how the Navy reviews unspent funds. Powered by Databricks Workflows, MLflow, Delta Lake and Apache Spark™, this data-driven model predicts which financial transactions are most likely to have errors, streamlining reviews and increasing accuracy. In FY24, it helped review $40 billion, freeing $1.1 billion for other priorities, including $260 million from active projects. By reducing reviews by 80%, cutting job runtime by over 50% and lowering costs by 60%, it saved 218,000 work hours and $6.7 million in labor costs. With automated workflows and robust data management, this system exemplifies how advanced tools can improve financial decision-making, save resources and ensure efficient use of taxpayer dollars.

Scaling Identity Graph Ingestion to 1M Events/Sec with Spark Streaming & Delta Lake

2025-06-11 Watch

talk

Akanksha Nagpal (Adobe) , Jianmei Ye (Adobe, Inc.)

Flink AWS Azure CDP Databricks Delta

Adobe’s Real-Time Customer Data Platform relies on the identity graph to connect over 70 billion identities and deliver personalized experiences. This session will showcase how the platform leverages Databricks, Spark Streaming and Delta Lake, along with 25+ Databricks deployments across multiple regions and clouds — Azure & AWS — to process terabytes of data daily and handle over a million records per second. The talk will highlight the platform’s ability to scale, demonstrating a 10x increase in ingestion pipeline capacity to accommodate peak traffic during events like the Super Bowl. Attendees will learn about the technical strategies employed, including migrating from Flink to Spark Streaming, optimizing data deduplication, and implementing robust monitoring and anomaly detection. Discover how these optimizations enable Adobe to deliver real-time identity resolution at scale while ensuring compliance and privacy.

Summit Live: Spark Talk - Everything Spark, Lakeflow Declarative Pipelines, and Open Source

2025-06-11 Watch

talk

Michael Armbrust (Databricks)

Databricks Delta Spark SQL

Databricks co-founders created Spark, the wildly popular open source foundation of Databricks, way back in 2009. Learn from Michael Armbrust, creator of Spark SQL and leader of Databricks Delta, about the latest happenings in Spark, Lakeflow Declarative Pipelines, and open source.

Apache Spark — Ask Us Anything

2025-06-11

lightning_talk

Allison Wang (Databricks) , Jules Damji (Databricks) , DB Tsai (Databricks)

API Big Data Spark

Join us for an interactive Ask Me Anything (AMA) session on the latest advancements in Apache Spark 4, including Spark Connect — the new client-server architecture enabling seamless integration with IDEs, notebooks and custom applications. Learn about performance improvements, enhanced APIs and best practices for leveraging Spark’s next-generation features. Whether you're a data engineer, Spark developer or big data enthusiast, bring your questions on architecture, real-world use cases and how these innovations can optimize your workflows. Don’t miss this chance to dive deep into the future of distributed computing with Spark!

PDF Document Ingestion Accelerator for GenAI Applications

2025-06-11 Watch

talk

Qian Yu (Databricks)

Databricks GenAI RAG Spark Data Streaming

Databricks Financial Service customers in the GenAI space have a common use case of ingestion and processing of unstructured documents — PDF/images — then performing downstream GenAI tasks such as entity extraction and RAG based knowledge Q&A. The pain points for the customers for these types of use cases are: The quality of the PDF/image documents varies since many older physical documents were scanned into electronic form The complexity of the PDF/image documents varies and many contain tables — images with embedding information — which require slower Tesseract OCR They would like to streamline postprocess for downstream workloads In this talk we will present an optimized structured streaming workflow for complex PDF ingestion. The key techniques include Apache Spark™ optimization, multi-threading, PDF object extraction, skew handling and auto retry logics

Breaking Barriers: Building Custom Spark 4.0 Data Connectors with Python

2025-06-11 Watch

talk

Sourav Gulati (Databricks) , Ashish Saraswat (Databricks)

API Java Python Scala Spark Data Streaming

Building a custom Spark data source connector once required Java or Scala expertise, making it complex and limiting. This left many proprietary data sources without public SDKs disconnected from Spark. Additionally, data sources with Python SDKs couldn't harness Spark’s distributed power. Spark 4.0 changes this with a new Python API for data source connectors, allowing developers to build fully functional connectors without Java or Scala. This unlocks new possibilities, from integrating proprietary systems to leveraging untapped data sources. Supporting both batch and streaming, this API makes data ingestion more flexible than ever. In this talk, we’ll demonstrate how to build a Spark connector for Excel using Python, showcasing schema inference, data reads/writes and streaming support. Whether you're a data engineer or Spark enthusiast, you’ll gain the knowledge to integrate Spark with any data source — entirely in Python.

Serverless as the New "Easy Button": How HP Inc. Used Serverless to Turbocharge Their Data Pipeline

2025-06-11 Watch

talk

Matthew Wright (Zahlen Solutions LLC) , Jason Hart (Zahlen Solutions)

Adobe Analytics Analytics AWS Databricks Spark

How do you wrangle over 8TB of granular “hit-level” website analytics data with hundreds of columns, all while eliminating the overhead of cluster management, decreasing runtime and saving money? In this session, we’ll dive into how we helped HP Inc. use Databricks serverless compute and Lakeflow Declarative Pipelines to streamline Adobe Analytics data ingestion while making it faster, cheaper and easier to operate. We’ll walk you through our full migration story — from managing unwieldy custom-defined AWS-based Apache Spark™ clusters to spinning up Databricks serverless pipelines and workflows with on-demand scalability and near-zero overhead. If you want to simplify infrastructure, optimize performance and get more out of your Databricks workloads, this session is for you.

talk-data.com

Top Topics

Top Speakers

Breaking Up With Spark Versions: Client APIs, AI-Powered Automatic Updates, and Dependency Management for Databricks Serverless

Iceberg Geo Type: Transforming Geospatial Data Management at Scale

Supercharging Sales Intelligence: Processing Billions of Events via Structured Streaming

Kill Bill-ing? Revenge is a Dish Best Served Optimized with GenAI

Sponsored by: definity | How You Could Be Saving 50% of Your Spark Costs

What’s New in Apache Spark™ 4.0?

Founder discussion: Matei on UC, Data Intelligence and AI Governance

Get the Most of Your Delta Lake

Incremental Iceberg Table Replication at Scale

Introducing Simplified State Tracking in Apache Spark™ Structured Streaming

Using Delta-rs and Delta-Kernel-rs to Serve CDC Feeds

Creating a Custom PySpark Stream Reader with PySpark 4.0

Sponsored by: dbt Labs | Leveling Up Data Engineering at Riot: How We Rolled Out dbt and Transformed the Developer Experience

Declarative Pipelines: What’s Next for the Apache Spark Ecosystem

Extending the Lakehouse: Power Interoperable Compute With Unity Catalog Open APIs

Race to Real-Time: Low-Latency Streaming ETL Meets Next-Gen Databricks OLTP-DB

Spark Right-Sizing: Saving Thousands of PBHrs of Compute at LinkedIn

Delivering Sub-Second Latency for Operational Workloads on Databricks

Empowering the Warfighter With AI

Scaling Identity Graph Ingestion to 1M Events/Sec with Spark Streaming & Delta Lake

Summit Live: Spark Talk - Everything Spark, Lakeflow Declarative Pipelines, and Open Source

Apache Spark — Ask Us Anything

PDF Document Ingestion Accelerator for GenAI Applications

Breaking Barriers: Building Custom Spark 4.0 Data Connectors with Python

Serverless as the New "Easy Button": How HP Inc. Used Serverless to Turbocharge Their Data Pipeline