talk-data.com talk-data.com

Event

Data + AI Summit 2025

2025-06-09 – 2025-06-13 Databricks Summit Visit website ↗

Activities tracked

66

Filtering by: Spark ×

Sessions & talks

Showing 26–50 of 66 · Newest first

Search within this event →
The Upcoming Apache Spark 4.1: The Next Chapter in Unified Analytics

The Upcoming Apache Spark 4.1: The Next Chapter in Unified Analytics

2025-06-11 Watch
talk
DB Tsai (Databricks) , Xiao Li (Databricks)

Apache Spark has long been recognized as the leading open-source unified analytics engine, combining a simple yet powerful API with a rich ecosystem and top-notch performance. In the upcoming Spark 4.1 release, the community reimagines Spark to excel at both massive cluster deployments and local laptop development. We’ll start with new single-node optimizations that make PySpark even more efficient for smaller datasets. Next, we’ll delve into a major “Pythonizing” overhaul — simpler installation, clearer error messages and Pythonic APIs. On the ETL side, we’ll explore greater data source flexibility (including the simplified Python Data Source API) and a thriving UDF ecosystem. We’ll also highlight enhanced support for real-time use cases, built-in data quality checks and the expanding Spark Connect ecosystem — bridging local workflows with fully distributed execution. Don’t miss this chance to see Spark’s next chapter!

Unity Catalog Lakeguard: Secure and Efficient Compute for Your Enterprise

Unity Catalog Lakeguard: Secure and Efficient Compute for Your Enterprise

2025-06-11 Watch
talk
Scott Van Woudenberg (Databricks) , Jakob Mund (Databricks)

Modern data workloads span multiple sources — data lakes, databases, apps like Salesforce and services like cloud functions. But as teams scale, secure data access and governance across shared compute becomes critical. In this session, learn how to confidently integrate external data and services into your workloads using Spark and Unity Catalog on Databricks. We'll explore compute options like serverless, clusters, workflows and SQL warehouses, and show how Unity Catalog’s Lakeguard enforces fine-grained governance — even when concurrently sharing compute by multiple users. Walk away ready to choose the right compute model for your team’s needs — without sacrificing security or efficiency.

Bridging Big Data and AI: Empowering PySpark With Lance Format for Multi-Modal AI Data Pipelines

Bridging Big Data and AI: Empowering PySpark With Lance Format for Multi-Modal AI Data Pipelines

2025-06-11 Watch
lightning_talk
Allison Wang (Databricks) , LU QIU (LanceDB)

PySpark has long been a cornerstone of big data processing, excelling in data preparation, analytics and machine learning tasks within traditional data lakes. However, the rise of multimodal AI and vector search introduces challenges beyond its capabilities. Spark’s new Python data source API enables integration with emerging AI data lakes built on the multi-modal Lance format. Lance delivers unparalleled value with its zero-copy schema evolution capability and robust support for large record-size data (e.g., images, tensors, embeddings, etc), simplifying multimodal data storage. Its advanced indexing for semantic and full-text search, combined with rapid random access, enables high-performance AI data analytics to the level of SQL. By unifying PySpark's robust processing capabilities with Lance's AI-optimized storage, data engineers and scientists can efficiently manage and analyze the diverse data types required for cutting-edge AI applications within a familiar big data framework.

Leveraging GenAI for Synthetic Data Generation to Improve Spark Testing and Performance in Big Data

Leveraging GenAI for Synthetic Data Generation to Improve Spark Testing and Performance in Big Data

2025-06-11 Watch
lightning_talk
Satej Kumar Sahu (Zalando SE)

Testing Spark jobs in local environments is often difficult due to the lack of suitable datasets, especially under tight timelines. This creates challenges when jobs work in development clusters but fail in production, or when they run locally but encounter issues in staging clusters due to inadequate documentation or checks. In this session, we’ll discuss how these challenges can be overcome by leveraging Generative AI to create custom synthetic datasets for local testing. By incorporating variations and sampling, a testing framework can be introduced to solve some of these challenges, allowing for the generation of realistic data to aid in performance and load testing. We’ll show how this approach helps identify performance bottlenecks early, optimize job performance and recognize scalability issues while keeping costs low. This methodology fosters better deployment practices and enhances the reliability of Spark jobs across environments.

Delta and Databricks as a Performant Exabyte-Scale Application Backend

Delta and Databricks as a Performant Exabyte-Scale Application Backend

2025-06-11 Watch
lightning_talk
Scott Schenkein (Capital One Financial)

The Delta Lake architecture promises to provide a single, highly functional, and high-scale copy of data that can be leveraged by a variety of tools to satisfy a broad range of use cases. To date, most use cases have focused on interactive data warehousing, ETL, model training, and streaming. Real-time access is generally delegated to costly and sometimes difficult-to-scale NoSQL, indexed storage, and domain-specific specialty solutions, which provide limited functionality compared to Spark on Delta Lake. In this session, we will explore the Delta data-skipping and optimization model and discuss how Capital One leveraged it along with Databricks photon and Spark Connect to implement a real-time web application backend. We’ll share how we built a highly-functional and performant security information and event management user experience (SIEM UX) that is cost effective.

Simplify Data Ingest and Egress with the New Python Data Source API

Simplify Data Ingest and Egress with the New Python Data Source API

2025-06-11 Watch
talk
Craig Lukasik (Databricks)

Data engineering teams are frequently tasked with building bespoke ingest and/or egress solutions for myriad custom, proprietary, or industry-specific data sources or sinks. Many teams find this work cumbersome and time-consuming. Recognizing these challenges, Databricks interviewed numerous companies across different industries to better understand their diverse data integration needs. This comprehensive feedback led us to develop the Python Data Source API for Apache Spark™.

Highways and Hexagons: Processing Large Geospatial Datasets With H3

Highways and Hexagons: Processing Large Geospatial Datasets With H3

2025-06-10 Watch
talk
Olivia Ren (Databricks) , Petr Andreev (Mantel Group)

The problem of matching GPS locations to roads and local government areas (LGAs) involves handling large datasets and a number of geospatial operations. In this deep dive, we will outline the challenges of developing scalable solutions for these tasks. We will discuss our multi-step approach, first focusing on the use of H3 indexing to isolate matches with single candidates, then explaining use of different geospatial computational techniques to accurately match points with multiple candidates. From technical perspective, the talk will showcase the use of broadcasting and partitioning techniques, their effect on autoscaling, memory usage and effective data parallelization. This session is for anyone interested in geospatial data, spark performance optimization and the real-world challenges of large-scale data engineering.

Real-Time Mode Technical Deep Dive: How We Built Sub-300 Millisecond Streaming Into Apache Spark™

Real-Time Mode Technical Deep Dive: How We Built Sub-300 Millisecond Streaming Into Apache Spark™

2025-06-10 Watch
talk
Siying Dong (Databricks) , Jerry Peng (Databricks)

Real-time mode is a new low-latency execution mode for Apache Spark™ Structured Streaming. It can consistently provide p99 latencies less than 300 milliseconds for a broad set of stateless and stateful streaming queries. Our talk focuses on the technical aspects of making this possible in Spark. We’ll dive into the core architecture that enables these dramatic latency improvements, including a concurrent stage scheduler and a non-blocking shuffle. We’ll explore how we maintained Spark’s fault-tolerance guarantees, and we’ll also share specific optimizations we made to our streaming SQL operators. These architectural improvements have already enabled Databricks customers to build workloads with latencies up to 10x lower than before. Early adopters in our Private Preview have successfully implemented real-time enrichment pipelines and feature engineering for machine learning — use cases that were previously impossible at these latencies.

Saving Millions From Millions: Navigating Towards Cost-Efficiency in Pinterest's Spark Jobs

Saving Millions From Millions: Navigating Towards Cost-Efficiency in Pinterest's Spark Jobs

2025-06-10 Watch
talk
Nan Zhu (Pinterest)

While Spark offers powerful processing capabilities for massive data volumes, cost-efficiency challenges are always bothering users operating at large scales. At Pinterest, where we run millions of Spark jobs monthly, maintaining infra cost efficiency is crucial to support our rapid business growth. To tackle this challenge, we have developed several strategies that have saved us tens of millions of dollars across numerous job instances. We will share our analytical methodology for identifying performance bottlenecks, and the technical solutions to overcome various challenges. Our approach includes extracting insights from billions of collected metrics, leveraging remote shuffle services to address shuffle slowness and improve memory utilization and reduce costs while hosting hundreds of millions of pods. The presentation aims to trigger more discussions about cost efficiency topics of Apache Spark in the community and help the community to tackle the common challenge.

The Future of DSv2 in Apache Spark™

The Future of DSv2 in Apache Spark™

2025-06-10 Watch
talk
Anton Okolnychyi (Databricks)

DSv2, Spark's next-generation Catalog API, is gaining traction among data source developers. It shifts complexity to Apache Spark™, improves connector reliability and unlocks new functionality such as catalog federation, MERGE operations, storage-partitioned joins, aggregate pushdown, stored procedures and more. This session covers the design of DSv2, current strengths and gaps and its evolving roadmap. It's intended for Spark users and developers working with data sources, whether custom-built or off-the-shelf.

Data Intelligence for Cybersecurity Forum: Insights From SAP, Anvilogic, Capital One, and Wiz

Data Intelligence for Cybersecurity Forum: Insights From SAP, Anvilogic, Capital One, and Wiz

2025-06-10 Watch
talk
Jiong Liu (Wiz) , Hemanth Varma Kusampudi (SAP) , Anil Chamarthy (Capital One) , Mackenzie Kyle (Anvilogic)

Join cybersecurity leaders from SAP, Anvilogic, Capital One, Wiz, and Databricks to explore how modern data intelligence is transforming security operations. Discover how SAP adopted a modular, AI-powered detection engineering lifecycle using Anvilogic on Databricks. Learn how Capital One built a detection and correlation engine leveraging Delta Lake, Apache Spark Streaming, and Databricks to process millions of cybersecurity events per second. Finally, see how Wiz and Databricks’ partnership enhances cloud security with seamless threat visibility. Through expert insights and live demos, gain strategies to build scalable, efficient cybersecurity powered by data and AI.

From Code to Insights: Leveraging Advanced Infrastructure and AI Capabilities.

From Code to Insights: Leveraging Advanced Infrastructure and AI Capabilities.

2025-06-10 Watch
lightning_talk
Shweta Shetty (Insulet)

In this talk, we will explore how AI and advanced infrastructure are transforming Insulet's development and operations. We'll highlight how our innovations have reduced scrap part costs through manufacturing analytics, showcasing efficiency and cost savings. On leveraging Databricks AI solutions and productivity, it not only identifies errors but also fixes code and assists in writing complex queries. This goes beyond suggestions, providing actual solutions. On the infrastructure side, integrating Spark with Databricks simplifies setup and reduces costs. Additionally Databricks Lakeflow Connect enables real-time updates and simplification without much coding as we integrate with Salesforce. We'll also discuss real-time processing of patient data, demonstrating how Databricks drives efficiency and productivity. Join us to learn how these innovations enhance efficiency, cost savings and performance.

Kernel, Catalog, Action! Reimagining our Delta-Spark Connector with DSv2

Kernel, Catalog, Action! Reimagining our Delta-Spark Connector with DSv2

2025-06-10 Watch
lightning_talk
Scott Sandre (Databricks)

Delta Lake is redesigning its Spark connector through the combination of three key technologies: First, we're updating our Spark APIs to DSv2 to achieve deeper catalog integration and improved integration with the Spark optimizer. Second, we're fully integrating on top of Delta Kernel to take advantage of its simplified abstraction of Delta protocol complexities, accelerating feature adoption and improving maintainability. Third, we are transforming Delta to become a catalog-aware lakehouse format with Catalog Commits, enabling more efficient metadata management, governance and query performance. Join us to explore how we're advancing Delta Lake's architecture, pushing the boundaries of metadata management and creating a more intelligent, performant data lakehouse platform.

Empowering Progress: Building a Personalized Training Goal Ecosystem with Databricks

Empowering Progress: Building a Personalized Training Goal Ecosystem with Databricks

2025-06-10 Watch
talk

Tonal is the ultimate strength training system, giving you the expertise of a personal trainer and a full gym in your home. Through user interviews and social media feedback, we identified a consistent challenge: members found it difficult to measure their progress in their fitness journey. To address this, we developed the Training Goal (TG) ecosystem, a four-part solution that introduced new preference options to capture users' fitness aspirations, implemented weekly metrics that accumulate as members complete workouts, defined personalized weekly targets to guide progress, and enhanced workout details to show how each session contributes toward individual goals.We present how we leveraged Spark, MLflow, and Workflows within the Databricks ecosystem to compute TG metrics, manage model development, and orchestrate data pipelines. These tools allowed us to launch the TG system on schedule, supporting scalability, reliability, and a more meaningful, personalized way for members to track their progress.

From Days to Seconds — Reducing Query Times on Large Geospatial Datasets by 99%

From Days to Seconds — Reducing Query Times on Large Geospatial Datasets by 99%

2025-06-10 Watch
talk
Chris Crawford (Databricks) , Hobson Bryan (Global Water Security Center)

The Global Water Security Center translates environmental science into actionable insights for the U.S. Department of Defense. Prior to incorporating Databricks, responding to these requests required querying approximately five hundred thousand raster files representing over five hundred billion points. By leveraging lakehouse architecture, Databricks Auto Loader, Spark Streaming, Databricks Spatial SQL, H3 geospatial indexing and Databricks Liquid Clustering, we were able to drastically reduce our “time to analysis” from multiple business days to a matter of seconds. Now, our data scientists execute queries on pre-computed tables in Databricks, resulting in a “time to analysis” that is 99% faster, giving our teams more time for deeper analysis of the data. Additionally, we’ve incorporated Databricks Workflows, Databricks Asset Bundles, Git and Git Actions to support CI/CD across workspaces. We completed this work in close partnership with Databricks.

Genie for Engineering: Optimizing HVAC Design and Operational Insights With Data and AI

2025-06-10
talk

In this session, we will explore how Genie, an AI-driven platform transformed HVAC operational insights by leveraging Databricks offerings like Apache Spark, Delta Lake and the Databricks Data Intelligence Platform.Key contributions: Real-time data processing: Lakeflow Declarative Pipelines and Apache Spark™ for efficient data ingestion and real-time analysis. Workflow orchestration: Databricks Data Intelligence Platform to orchestrate complex workflows and integrate various data sources and analytical tools. Field Data Integration: Incorporating real-time field data into design and algorithm development, enabling engineers to make informed adjustments and optimize performance. By analyzing real-time data from HVAC installations, Genie identified discrepancies between design specs and field performance, allowing engineers to optimize algorithms, reduce inefficiencies and improve customer satisfaction. Discover how Genie revolutionized HVAC management and apply to your projects.

Automated Deployment with Databricks Asset Bundles

2025-06-10
talk

This course provides a comprehensive review of DevOps principles and their application to Databricks projects. It begins with an overview of core DevOps, DataOps, continuous integration (CI), continuous deployment (CD), and testing, and explores how these principles can be applied to data engineering pipelines. The course then focuses on continuous deployment within the CI/CD process, examining tools like the Databricks REST API, SDK, and CLI for project deployment. You will learn about Databricks Asset Bundles (DABs) and how they fit into the CI/CD process. You’ll dive into their key components, folder structure, and how they streamline deployment across various target environments in Databricks. You will also learn how to add variables, modify, validate, deploy, and execute Databricks Asset Bundles for multiple environments with different configurations using the Databricks CLI. Finally, the course introduces Visual Studio Code as an Interactive Development Environment (IDE) for building, testing, and deploying Databricks Asset Bundles locally, optimizing your development process. The course concludes with an introduction to automating deployment pipelines using GitHub Actions to enhance the CI/CD workflow with Databricks Asset Bundles. By the end of this course, you will be equipped to automate Databricks project deployments with Databricks Asset Bundles, improving efficiency through DevOps practices. Pre-requisites: Strong knowledge of the Databricks platform, including experience with Databricks Workspaces, Apache Spark, Delta Lake, the Medallion Architecture, Unity Catalog, Delta Live Tables, and Workflows. In particular, knowledge of leveraging Expectations with Lakeflow Declarative Pipelines. Labs : Yes Certification Path: Databricks Certified Data Engineer Professional

De-Risking Investment Decisions: QCG's Smarter Deal Evaluation Process Leveraging Databricks

De-Risking Investment Decisions: QCG's Smarter Deal Evaluation Process Leveraging Databricks

2025-06-10 Watch
lightning_talk
Ian Brown (Quantum Capital Group)

Quantum Capital Group (QCG) screens hundreds of deals across the global Sustainable Energy Ecosystem, requiring deep technical due diligence. With over 1.5 billion records sourced from public, premium and proprietary datasets, their challenge was how to efficiently curate, analyze and share this data to drive smarter investment decisions. QCG partnered with Databricks & Tiger Analytics to modernize its data landscape. Using Delta tables, Spark SQL, and Unity Catalog, the team built a golden dataset that powers proprietary evaluation models and automates complex workflows. Data is now seamlessly curated, enriched and distributed — both internally and to external stakeholders — in a secure, governed and scalable way. This session explores how QCG’s investment in data intelligence has turned an overwhelming volume of information into a competitive advantage, transforming deal evaluation into a faster, more strategic process.

ViewShift: Dynamic Policy Enforcement With Spark and SQL Views

ViewShift: Dynamic Policy Enforcement With Spark and SQL Views

2025-06-10 Watch
lightning_talk
Khai Tran (LinkedIn) , Walaa Moustafa (LinkedIn)

Dynamic policy enforcement is increasingly critical in today's landscape, where data compliance is a top priorities for companies, individuals, and regulators alike. In this talk, Walaa explores how LinkedIn has implemented a robust dynamic policy enforcement engine, ViewShift, and integrated it within its data lake. He will demystify LinkedIn's query engine stack by demonstrating how catalogs can automatically route table resolutions to compliance-enforcing SQL views. These SQL views possess several noteworthy properties: Auto-Generated: Created automatically from declarative data annotations. User-Centric: They honor user-level consent and preferences. Context-Aware: They apply different transformations tailored to specific use cases. Portable: Despite the SQL logic being implemented in a single dialect, it remains accessible across all engines. Join this session to learn how ViewShift helps ensure that compliance is seamlessly integrated into data processing workflows.

Lakeflow Declarative Pipelines Integrations and Interoperability: Get Data From — and to — Anywhere

Lakeflow Declarative Pipelines Integrations and Interoperability: Get Data From — and to — Anywhere

2025-06-10 Watch
talk
Ryan Nienhuis (Databricks)

This session is repeated.In this session, you will learn how to integrate Lakeflow Declarative Pipelines with external systems in order to ingest and send data virtually anywhere. Lakeflow Declarative Pipelines is most often used in ingestion and ETL into the Lakehouse. New Lakeflow Declarative Pipelines capabilities like the Lakeflow Declarative Pipelines Sinks API and added support for Python Data Source and ForEachBatch have opened up Lakeflow Declarative Pipelines to support almost any integration. This includes popular Apache Spark™ integrations like JDBC, Kafka, External and managed Delta tables, Azure CosmosDB, MongoDB and more.

How Databricks Powers Real-Time Threat Detection at Barracuda XDR

How Databricks Powers Real-Time Threat Detection at Barracuda XDR

2025-06-10 Watch
talk
Alex Dangel (Barracuda Networks) , Merium Khalid (Barracuda Networks)

As cybersecurity threats grow in volume and complexity, organizations must efficiently process security telemetry for best-in-class detection and mitigation. Barracuda’s XDR platform is redefining security operations by layering advanced detection methodologies over a broad range of supported technologies. Our vision is to deliver unparalleled protection through automation, machine learning and scalable detection frameworks, ensuring threats are identified and mitigated quickly. To achieve this, we have adopted Databricks as the foundation of our security analytics platform, providing greater control and flexibility while decoupling from traditional SIEM tools. By leveraging Lakeflow Declarative Pipelines, Spark Structured Streaming and detection-as-code CI/CD pipelines, we have built a real-time detection engine that enhances scalability, accuracy and cost efficiency. This session explores how Databricks is shaping the future of XDR through real-time analytics and cloud-native security.

Scaling XGBoost With Spark Connect ML on Grace Blackwell

Scaling XGBoost With Spark Connect ML on Grace Blackwell

2025-06-10 Watch
talk
Bobby Wang (NVIDIA) , Jiaming Yuan (​NVIDIA Semiconductor Co., Ltd)

XGBoost is one of the off-the-shelf gradient boosting algorithms for analyzing tabular datasets. Unlike deep learning, gradient-boosting decision trees require the entire dataset to be in memory for efficient model training. To overcome the limitation, XGBoost features a distributed out-of-core implementation that fetches data in batch, which benefits significantly from the latest NVIDIA GPUs and the NVLink-C2C’s ultra bandwidth. In this talk, we will share our work on optimizing XGBoost using the Grace Blackwell super chip. The fast chip-to-chip link between the CPU and the GPU enables XGBoost to scale up without compromising performance. Our work has effectively increased XGBoost’s training capacity to over 1.2TB on a single node. The approach is scalable to GPU clusters using Spark, enabling XGBoost to handle terabytes of data efficiently. We will demonstrate combining XGBoost out-of-core algorithms with the latest connect ML from Spark 4.0 for large model training workflows.

Spark 4.0 and Delta 4.0 For Streaming Data

Spark 4.0 and Delta 4.0 For Streaming Data

2025-06-10 Watch
talk
Bryce Bartmann (Shell)

Real-time data is one of the most important datasets for any Data and AI Platform across any industry. Spark 4.0 and Delta 4.0 include new features that make ingestion and querying of real-time data better than ever before. Features such as: Python custom data sources for simple ingestion of streaming and batch time series data sources using Spark Variant types for managing variable data types and json payloads that are common in the real time domain Delta liquid clustering for simple data clustering without the overhead or complexity of partitioning In this presentation you will learn how data teams can leverage these latest features to build industry-leading, real-time data products using Spark and Delta and includes real world examples and metrics of the improvements they make in performance and processing of data in the real time space.

Spark Connect: Flexible, Local Access to Apache Spark at Scale

Spark Connect: Flexible, Local Access to Apache Spark at Scale

2025-06-10 Watch
talk
James Malone (Databricks)

What if you could run Spark jobs without worrying about clusters, versions and upgrades? Did you know Spark has this functionality built-in today? Join us to take a look at this functionality — Spark Connect. Join us to dig into how Spark Connect works — abstracting away Spark clusters away in favor of the DataFrame API and unresolved logical plans. You will learn some of the cool things Spark Connect unlocks, including: Moving you from thinking about clusters to just thinking about jobs Making Spark code more portable and platform agnostic Enabling support for languages such as Go

From Spaghetti Bowl Pipeline to Lakeflow Declarative Pipelines Efficiency

2025-06-10
lightning_talk
Peter Jones (Intermountain Healthcare)

In today's data-driven world, the ability to efficiently manage and transform data is crucial for any organization. This presentation will explore the process of converting a complex and messy workflow into a clean and simple Lakeflow Declarative Pipelines at a large integrated health system, Intermountain Health.Alteryx is a powerful tool for data preparation and blending, but as workflows grow in complexity, they can become difficult to manage and maintain. Lakeflow Declarative Pipelines, on the other hand, offers a more democratized, streamlined and scalable approach to data engineering, leveraging the power of Apache Spark and Delta Lake.We will begin by examining a typical legacy workflow, identifying common pain points such as tangled logic, performance bottlenecks and maintenance challenges. Next, we will demonstrate how to translate this workflow into a Lakeflow Declarative Pipelines, highlighting key steps such as data transformation, validation and delivery.