talk-data.com talk-data.com

Event

Data + AI Summit 2025

2025-06-09 – 2025-06-13 Databricks Summit Visit website ↗

Activities tracked

66

Filtering by: Spark ×

Sessions & talks

Showing 1–25 of 66 · Newest first

Search within this event →
Breaking Up With Spark Versions: Client APIs, AI-Powered Automatic Updates, and Dependency Management for Databricks Serverless

Breaking Up With Spark Versions: Client APIs, AI-Powered Automatic Updates, and Dependency Management for Databricks Serverless

2025-06-12 Watch
talk
Justin Breese (Databricks)

This session explains how we've made our Apache Spark™ versionless for end users by introducing a stable client API, environment versioning and automatic remediation. These capabilities have enabled auto-upgrade of hundreds of millions of workloads with minimal disruption for Serverless Notebooks and Jobs. We'll also introduce a new approach to dependency management using environments. Admins will learn how to speed up package installation with Default Base Environments, and users will see how to manage custom environments for their own workloads.

Iceberg Geo Type: Transforming Geospatial Data Management at Scale

Iceberg Geo Type: Transforming Geospatial Data Management at Scale

2025-06-12 Watch
talk
Szehon Ho (Databricks) , Jia Yu (Wherobots Inc.)

The Apache Iceberg™ community is introducing native geospatial type support, addressing key challenges in managing geospatial data at scale, including fragmented formats and inefficiencies in storing large spatial datasets. This talk will delve into the origins of the Iceberg geo type, its specification design and future goals. We will examine the impact on both the geospatial and Iceberg communities, in introducing a standard data warehouse storage layer to the geospatial community, and enabling optimized geospatial analytics for Iceberg users. We will also present a live demonstration of the Iceberg geo data type with Apache Sedona™ and Apache Spark™, showcasing how it simplifies and accelerates geospatial analytics workflows and queries. Finally, we will also provide an in-depth look at its current capabilities and outline the roadmap for future developments, and offer a perspective on its role in advancing geospatial data management in the industry.

Supercharging Sales Intelligence: Processing Billions of Events via Structured Streaming

Supercharging Sales Intelligence: Processing Billions of Events via Structured Streaming

2025-06-12 Watch
talk
Anurag Bharati (DigiCert) , Nikita Raje (DigiCert)

DigiCert is a digital security company that provides digital certificates, encryption and authentication services and serves 88% of the Fortune 500, securing over 28 billion web connections daily. Our project aggregates and analyzes certificate transparency logs via public APIs to provide comprehensive market and competitive intelligence. Instead of relying on third-party providers with limited data, our project gives full control, deeper insights and automation. Databricks has helped us reliably poll public APIs in a scalable manner that fetches millions of events daily, deduplicate and store them in our Delta tables. We specifically use Spark for parallel processing, structured streaming for real-time ingestion and deduplication, Delta tables for data reliability, pools and jobs to ensure our costs are optimized. These technologies help us keep our data fresh, accurate and cost effective. This data has helped our sales team with real-time intelligence, ensuring DigiCert's success.

Kill Bill-ing? Revenge is a Dish Best Served Optimized with GenAI

Kill Bill-ing? Revenge is a Dish Best Served Optimized with GenAI

2025-06-12 Watch
lightning_talk
Abdul Furkhan (Sportsbet)

In an era where cloud costs can spiral out of control, Sportsbet achieved a remarkable 49% reduction in Total Cost of Ownership (TCO) through an innovative AI-powered solution called 'Kill Bill.' This presentation reveals how we transformed Databricks' consumption-based pricing model from a challenge into a strategic advantage through an intelligent automation and optimization. Understand how to use GenAI to reduce Databricks TCO Leverage generative AI within Databricks solutions enables automated analysis of cluster logs, resource consumption, configurations, and codebases to provide Spark optimization suggestions Create AI agentic workflows by integrating Databricks' AI tools and Databricks Data Engineering tools Review a case study demonstrating how Total Cost of Ownership was reduced in practice. Attendees will leave with a clear understanding of how to implement AI within Databricks solutions to address similar cost challenges in their environments.

Sponsored by: definity | How You Could Be Saving 50% of Your Spark Costs

Sponsored by: definity | How You Could Be Saving 50% of Your Spark Costs

2025-06-12 Watch
lightning_talk
Roy Daniel (definity)

Enterprise lakehouse platforms are rapidly scaling – and so are complexity and cost. After monitoring over 1B vCore-hours across Databricks and other Apache Spark™ environments, we consistently saw resource waste, preventable data incidents, and painful troubleshooting. Join this session to discover how definity’s unique full-stack observability provides job-level visibility in-motion, unifying infrastructure performance, pipeline execution, and data behavior, and see how enterprise teams use definity to easily optimize jobs and save millions – while proactively ensuring SLAs, preventing issues, and simplifying RCA.

What’s New in Apache Spark™ 4.0?

What’s New in Apache Spark™ 4.0?

2025-06-12 Watch
talk
Wenchen Fan (Databricks) , Daniel Tenedorio (Databricks)

Join this session for a concise tour of Apache Spark™ 4.0’s most notable enhancements: SQL features: ANSI by default, scripting, SQL pipe syntax, SQL UDF, session variable, view schema evolution, etc. Data type: VARIANT type, string collation Python features: Python data source, plotting API, etc. Streaming improvements: State store data source, state store checkpoint v2, arbitrary state v2, etc. Spark Connect improvements: More API coverage, thin client, unified Scala interface, etc. Infrastructure: Better error message, structured logging, new Java/Scala version support, etc. Whether you’re a seasoned Spark user or new to the ecosystem, this talk will prepare you to leverage Spark 4.0’s latest innovations for modern data and AI pipelines.

Founder discussion: Matei on UC, Data Intelligence and AI Governance

Founder discussion: Matei on UC, Data Intelligence and AI Governance

2025-06-12 Watch
talk
Matei Zaharia (Databricks)

Matei is a legend of open source: he started the Apache Spark project in 2009, co-founded Databricks, and worked on other widely used data and AI software, including MLflow, Delta Lake, and Dolly. His most recent research is about combining large language models (LLMs) with external data sources, such as search systems, and improving their efficiency and result quality. This will be a conversation coverering the latest and greatest of UC, Data Intelligence, AI Governance, and more.

Get the Most of Your Delta Lake

Get the Most of Your Delta Lake

2025-06-12 Watch
lightning_talk
Youssef Mrini (Databricks)

Unlock the full potential of Delta Lake, the open-source storage framework for Apache Spark, with this session focused on its latest and most impactful features. Discover how capabilities like Time Travel, Column Mapping, Deletion Vectors, Liquid Clustering, UniForm interoperability, and Change Data Feed (CDF) can transform your data architecture. Learn not just what these features do, but when and how to use them to maximize performance, simplify data management, and enable advanced analytics across your lakehouse environment.

Incremental Iceberg Table Replication at Scale

Incremental Iceberg Table Replication at Scale

2025-06-12 Watch
talk
Hongyue Hongyue (Self-Employed) , Szehon Ho (Databricks)

Apache Iceberg is a popular table format for managing large analytical datasets. But replicating iceberg tables at scale can be a daunting task — especially when dealing with its hierarchical metadata. In this talk, we present an end-to-end workflow for replicating Apache Iceberg tables, leveraging Apache Spark to ensure that backup tables remain identical to their source counterparts. More excitingly, we have contributed these libraries back to the open-source community. Attendees will gain a comprehensive understanding of how to set up replication workflows for Iceberg tables, as well as practical guidance on how to manage and maintain replicated datasets at scale. This talk is ideal for data engineers, platform architects and practitioners looking to apply replication and disaster recovery for Apache Iceberg in complex data ecosystems.

Introducing Simplified State Tracking in Apache Spark™ Structured Streaming

Introducing Simplified State Tracking in Apache Spark™ Structured Streaming

2025-06-12 Watch
lightning_talk
Craig Lukasik (Databricks)

This presentation will review the new change feed and snapshot capabilities in Apache Spark™ Structured Streaming’s State Reader API. The State Reader API enables users to access and analyze Structured Streaming's internal state data. Readers will learn how to leverage the new features to debug, troubleshoot and analyze state changes efficiently, making streaming workloads easier to manage at scale.

Using Delta-rs and Delta-Kernel-rs to Serve CDC Feeds

Using Delta-rs and Delta-Kernel-rs to Serve CDC Feeds

2025-06-12 Watch
talk
Stephen Carman (Databricks) , Oussama Saoudi (Databricks)

Change data feeds are a common tool for synchronizing changes between tables and performing data processing in a scalable fashion. Serverless architectures offer a compelling solution for organizations looking to avoid the complexity of managing infrastructure. But how can you bring CDFs into a serverless environment? In this session, we'll explore how to integrate Change Data Feeds into serverless architectures using Delta-rs and Delta-kernel-rs—open-source projects that allow you to read Delta tables and their change data feeds in Rust or Python. We’ll demonstrate how to use these tools with Lakestore’s serverless platform to easily stream and process changes. You’ll learn how to: Leverage Delta tables and CDFs in serverless environments Utilize Databricks and Unity Catalog without needing Apache Spark

Creating a Custom PySpark Stream Reader with PySpark 4.0

Creating a Custom PySpark Stream Reader with PySpark 4.0

2025-06-11 Watch
lightning_talk
Skyler Myers (Entrada)

PySpark supports many data sources out of the box, such as Apache Kafka, JDBC, ODBC, Delta Lake, etc. However, some older systems, such as systems that use JMS protocol, are not supported by default and require considerable extra work for developers to read from them. One such example is ActiveMQ for streaming. Traditionally, users of ActiveMQ have to use a middle-man in order to read the stream with Spark (such as writing to a MySQL DB using Java code and reading that table with Spark JDBC). With PySpark 4.0’s custom data sources (supported in DBR 15.3+) we are able to cut out the middle-man processing using batch or Spark Streaming and consume the queues directly from PySpark, saving developers considerable time and complexity in getting source data into your Delta Lake and governed by Unity Catalog and orchestrated with Databricks Workflows.

Sponsored by: dbt Labs | Leveling Up Data Engineering at Riot: How We Rolled Out dbt and Transformed the Developer Experience

Sponsored by: dbt Labs | Leveling Up Data Engineering at Riot: How We Rolled Out dbt and Transformed the Developer Experience

2025-06-11 Watch
lightning_talk
Marco Garcia (Riot Games)

Riot Games reduced its Databricks compute spend and accelerated development cycles by transforming its data engineering workflows—migrating from bespoke Databricks notebooks and Spark pipelines to a scalable, testable, and developer-friendly dbt-based architecture. In this talk, members of the Developer Experience & Automation (DEA) team will walk through how they designed and operationalized dbt to support Riot’s evolving data needs.

Declarative Pipelines: What’s Next for the Apache Spark Ecosystem

Declarative Pipelines: What’s Next for the Apache Spark Ecosystem

2025-06-11 Watch
talk
Michael Armbrust (Databricks) , Sandy Ryza (Databricks)

Lakeflow Declarative Pipelines has made it dramatically easier to build production-grade Spark pipelines, using a framework that abstracts away orchestration and complexity. It’s become a go-to solution for teams who want reliable, maintainable pipelines without reinventing the wheel.But we’re just getting started. In this session, we’ll take a step back and share a broader vision for the future of Spark Declarative Pipelines — one that opens the door to a new level of openness, standardization and community momentum.We’ll cover the core concepts behind Declarative Pipelines, where the architecture is headed, and what this shift means for both existing Lakeflow users and Spark engineers building procedural code. Don’t miss this session — we’ll be sharing something new that sets the direction for what comes next.

Extending the Lakehouse: Power Interoperable Compute With Unity Catalog Open APIs

Extending the Lakehouse: Power Interoperable Compute With Unity Catalog Open APIs

2025-06-11 Watch
talk
Tathagata Das (Databricks) , Michelle Leon (Databricks)

The lakehouse is built for storage flexibility, but what about compute? In this session, we’ll explore how Unity Catalog enables you to connect and govern multiple compute engines across your data ecosystem. With open APIs and support for the Iceberg REST Catalog, UC lets you extend access to engines like Trino, DuckDB, and Flink while maintaining centralized security, lineage, and interoperability. We will show how you can get started today working with engines like Apache Spark and Starburst to read and write to UC managed tables with some exciting demos. Learn how to bring flexibility to your compute layer—without compromising control.

Race to Real-Time: Low-Latency Streaming ETL Meets Next-Gen Databricks OLTP-DB

Race to Real-Time: Low-Latency Streaming ETL Meets Next-Gen Databricks OLTP-DB

2025-06-11 Watch
lightning_talk
Irfan Elahi (Databricks)

In today’s digital economy, real-time insights and rapid responsiveness are paramount to delivering exceptional user experiences and lowering TCO. In this session, discover a pioneering approach that leverages a low-latency streaming ETL pipeline built with Spark Structured Streaming and Databricks’ new OLTP-DB—a serverless, managed Postgres offering designed for transactional workloads. Validated in a live customer scenario, this architecture achieves sub-2 second end-to-end latency by seamlessly ingesting streaming data from Kinesis and merging it into OLTP-DB. This breakthrough not only enhances performance and scalability but also provides a replicable blueprint for transforming data pipelines across various verticals. Join us as we delve into the advanced optimization techniques and best practices that underpin this innovation, demonstrating how Databricks’ next-generation solutions can revolutionize real-time data processing and unlock a myriad of new use cases in data landscape.

Spark Right-Sizing: Saving Thousands of PBHrs of Compute at LinkedIn

Spark Right-Sizing: Saving Thousands of PBHrs of Compute at LinkedIn

2025-06-11 Watch
lightning_talk
Shreyesh Arangath (LinkedIn)

At LinkedIn, we manage over 400,000 daily Spark applications consuming 200+ PBHrs of compute daily. To address the challenges posed by manual configuration of Spark's memory tuning options, which led to low memory utilization and frequent OOM errors, we developed an automated Spark executor memory right-sizing system. Our approach, utilizing a policy-based system with nearline and real-time feedback loops, automates memory tuning, leading to more efficient resource allocation, improved user productivity and increased job reliability. By leveraging historical data and real-time error classification, we dynamically adjust memory, significantly narrowing the gap between allocated and utilized resources while reducing failures. This initiative has achieved a 13% increase in memory utilization and a 90% drop in OOM-related job failures, saving us 1000s of PBHrs of compute every year.

Delivering Sub-Second Latency for Operational Workloads on Databricks

Delivering Sub-Second Latency for Operational Workloads on Databricks

2025-06-11 Watch
talk
Karthikeyan Ramasamy (Databricks) , Jerry Peng (Databricks)

As enterprise streaming adoption accelerates, more teams are turning to real-time processing to support operational workloads that require sub-second response times. To address this need, Databricks introduced Project Lightspeed in 2022, which recently delivered Real-Time Mode in Apache Spark™ Structured Streaming. This new mode achieves consistent p99 latencies under 300ms for a wide range of stateless and stateful streaming queries. In this session, we’ll define what constitutes an operational use case, outline typical latency requirements and walk through how to meet those SLAs using Real-Time Mode in Structured Streaming.

Empowering the Warfighter With AI

Empowering the Warfighter With AI

2025-06-11 Watch
talk
Teneika Askew (Navy)

The new Budget Execution Validation process has transformed how the Navy reviews unspent funds. Powered by Databricks Workflows, MLflow, Delta Lake and Apache Spark™, this data-driven model predicts which financial transactions are most likely to have errors, streamlining reviews and increasing accuracy. In FY24, it helped review $40 billion, freeing $1.1 billion for other priorities, including $260 million from active projects. By reducing reviews by 80%, cutting job runtime by over 50% and lowering costs by 60%, it saved 218,000 work hours and $6.7 million in labor costs. With automated workflows and robust data management, this system exemplifies how advanced tools can improve financial decision-making, save resources and ensure efficient use of taxpayer dollars.

Scaling Identity Graph Ingestion to 1M Events/Sec with Spark Streaming & Delta Lake

Scaling Identity Graph Ingestion to 1M Events/Sec with Spark Streaming & Delta Lake

2025-06-11 Watch
talk
Akanksha Nagpal (Adobe) , Jianmei Ye (Adobe, Inc.)

Adobe’s Real-Time Customer Data Platform relies on the identity graph to connect over 70 billion identities and deliver personalized experiences. This session will showcase how the platform leverages Databricks, Spark Streaming and Delta Lake, along with 25+ Databricks deployments across multiple regions and clouds — Azure & AWS — to process terabytes of data daily and handle over a million records per second. The talk will highlight the platform’s ability to scale, demonstrating a 10x increase in ingestion pipeline capacity to accommodate peak traffic during events like the Super Bowl. Attendees will learn about the technical strategies employed, including migrating from Flink to Spark Streaming, optimizing data deduplication, and implementing robust monitoring and anomaly detection. Discover how these optimizations enable Adobe to deliver real-time identity resolution at scale while ensuring compliance and privacy.

Summit Live: Spark Talk - Everything Spark, Lakeflow Declarative Pipelines, and Open Source

Summit Live: Spark Talk - Everything Spark, Lakeflow Declarative Pipelines, and Open Source

2025-06-11 Watch
talk
Michael Armbrust (Databricks)

Databricks co-founders created Spark, the wildly popular open source foundation of Databricks, way back in 2009. Learn from Michael Armbrust, creator of Spark SQL and leader of Databricks Delta, about the latest happenings in Spark, Lakeflow Declarative Pipelines, and open source.

Apache Spark — Ask Us Anything

2025-06-11
lightning_talk
Allison Wang (Databricks) , Jules Damji (Databricks) , DB Tsai (Databricks)

Join us for an interactive Ask Me Anything (AMA) session on the latest advancements in Apache Spark 4, including Spark Connect — the new client-server architecture enabling seamless integration with IDEs, notebooks and custom applications. Learn about performance improvements, enhanced APIs and best practices for leveraging Spark’s next-generation features. Whether you're a data engineer, Spark developer or big data enthusiast, bring your questions on architecture, real-world use cases and how these innovations can optimize your workflows. Don’t miss this chance to dive deep into the future of distributed computing with Spark!

PDF Document Ingestion Accelerator for GenAI Applications

PDF Document Ingestion Accelerator for GenAI Applications

2025-06-11 Watch
talk
Qian Yu (Databricks)

Databricks Financial Service customers in the GenAI space have a common use case of ingestion and processing of unstructured documents — PDF/images — then performing downstream GenAI tasks such as entity extraction and RAG based knowledge Q&A. The pain points for the customers for these types of use cases are: The quality of the PDF/image documents varies since many older physical documents were scanned into electronic form The complexity of the PDF/image documents varies and many contain tables — images with embedding information — which require slower Tesseract OCR They would like to streamline postprocess for downstream workloads In this talk we will present an optimized structured streaming workflow for complex PDF ingestion. The key techniques include Apache Spark™ optimization, multi-threading, PDF object extraction, skew handling and auto retry logics

Breaking Barriers: Building Custom Spark 4.0 Data Connectors with Python

Breaking Barriers: Building Custom Spark 4.0 Data Connectors with Python

2025-06-11 Watch
talk
Sourav Gulati (Databricks) , Ashish Saraswat (Databricks)

Building a custom Spark data source connector once required Java or Scala expertise, making it complex and limiting. This left many proprietary data sources without public SDKs disconnected from Spark. Additionally, data sources with Python SDKs couldn't harness Spark’s distributed power. Spark 4.0 changes this with a new Python API for data source connectors, allowing developers to build fully functional connectors without Java or Scala. This unlocks new possibilities, from integrating proprietary systems to leveraging untapped data sources. Supporting both batch and streaming, this API makes data ingestion more flexible than ever. In this talk, we’ll demonstrate how to build a Spark connector for Excel using Python, showcasing schema inference, data reads/writes and streaming support. Whether you're a data engineer or Spark enthusiast, you’ll gain the knowledge to integrate Spark with any data source — entirely in Python.

Serverless as the New "Easy Button": How HP Inc. Used Serverless to Turbocharge Their Data Pipeline

Serverless as the New "Easy Button": How HP Inc. Used Serverless to Turbocharge Their Data Pipeline

2025-06-11 Watch
talk
Matthew Wright (Zahlen Solutions LLC) , Jason Hart (Zahlen Solutions)

How do you wrangle over 8TB of granular “hit-level” website analytics data with hundreds of columns, all while eliminating the overhead of cluster management, decreasing runtime and saving money? In this session, we’ll dive into how we helped HP Inc. use Databricks serverless compute and Lakeflow Declarative Pipelines to streamline Adobe Analytics data ingestion while making it faster, cheaper and easier to operate. We’ll walk you through our full migration story — from managing unwieldy custom-defined AWS-based Apache Spark™ clusters to spinning up Databricks serverless pipelines and workflows with on-demand scalability and near-zero overhead. If you want to simplify infrastructure, optimize performance and get more out of your Databricks workloads, this session is for you.