Data + AI Summit 2025

Advanced Governance and Auth With Databricks Apps

2025-06-12 Watch

talk

Andre Furlan Bueno (Databricks) , Doug Judice (Addepar)

API Databricks

Explore advanced governance and authentication patterns for building secure, enterprise-grade apps with Databricks Apps. Learn how to configure complex permissions and manage access control using Unity Catalog. We’ll dive into “on-behalf-of-user” authentication — allowing agents to enforce user-specific access controls — and cover API-based authentication, including PATs and OAuth flows for external integrations. We’ll also highlight how Addepar uses these capabilities to securely build and scale applications that handle sensitive financial data. Whether you're building internal tools or customer-facing apps, this session will equip you with the patterns and tools to ensure robust, secure access in your Databricks apps.

Automating Taxonomy Generation With Compound AI on Databricks

2025-06-12 Watch

talk

Allistair Cota (Lovelytics) , Sudhir Gajre (Lovelytics)

AI/ML API Databricks LLM

Taxonomy generation is a challenge across industries such as retail, manufacturing and e-commerce. Incomplete or inconsistent taxonomies can lead to fragmented data insights, missed monetization opportunities and stalled revenue growth. In this session, we will explore a modern approach to solving this problem by leveraging Databricks platform to build a scalable compound AI architecture for automated taxonomy generation. The first half of the session will walk you through the business significance and implications of taxonomy, followed by a technical deep dive in building an architecture for taxonomy implementation on the Databricks platform using a compound AI architecture. We will walk attendees through the anatomy of taxonomy generation, showcasing an innovative solution that combines multimodal and text-based LLMs, internal data sources and external API calls. This ensemble approach ensures more accurate, comprehensive and adaptable taxonomies that align with business needs.

Breaking Up With Spark Versions: Client APIs, AI-Powered Automatic Updates, and Dependency Management for Databricks Serverless

2025-06-12 Watch

talk

Justin Breese (Databricks)

AI/ML API Databricks Spark

This session explains how we've made our Apache Spark™ versionless for end users by introducing a stable client API, environment versioning and automatic remediation. These capabilities have enabled auto-upgrade of hundreds of millions of workloads with minimal disruption for Serverless Notebooks and Jobs. We'll also introduce a new approach to dependency management using environments. Admins will learn how to speed up package installation with Default Base Environments, and users will see how to manage custom environments for their own workloads.

Evaluation-Driven Development Workflows: Best Practices and Real-World Scenarios

2025-06-12 Watch

talk

Wenwen Xie (Databricks) , Arthur Dooner (Databricks)

AI/ML API LLM

In enterprise AI, Evaluation-Driven Development (EDD) ensures reliable, efficient systems by embedding continuous assessment and improvement into the AI development lifecycle. High-quality evaluation datasets are created using techniques like document analysis, synthetic data generation via Mosaic AI’s synthetic data generation API, SME validation, and relevance filtering, reducing manual effort and accelerating workflows. EDD focuses on metrics such as context relevance, groundedness, and response accuracy to identify and address issues like retrieval errors or model limitations. Custom LLM judges, tailored to domain-specific needs like PII detection or tone assessment, enhance evaluations. By leveraging tools like Mosaic AI Agent Framework and Agent Evaluation, MLflow, EDD automates data tracking, streamlines workflows, and quantifies improvements, transforming AI development for delivering scalable, high-performing systems that drive measurable organizational value.

Supercharging Sales Intelligence: Processing Billions of Events via Structured Streaming

2025-06-12 Watch

talk

Anurag Bharati (DigiCert) , Nikita Raje (DigiCert)

API Databricks Delta Cyber Security Spark Data Streaming

DigiCert is a digital security company that provides digital certificates, encryption and authentication services and serves 88% of the Fortune 500, securing over 28 billion web connections daily. Our project aggregates and analyzes certificate transparency logs via public APIs to provide comprehensive market and competitive intelligence. Instead of relying on third-party providers with limited data, our project gives full control, deeper insights and automation. Databricks has helped us reliably poll public APIs in a scalable manner that fetches millions of events daily, deduplicate and store them in our Delta tables. We specifically use Spark for parallel processing, structured streaming for real-time ingestion and deduplication, Delta tables for data reliability, pools and jobs to ensure our costs are optimized. These technologies help us keep our data fresh, accurate and cost effective. This data has helped our sales team with real-time intelligence, ensuring DigiCert's success.

Sponsored by: DataHub | Beyond the Lakehouse: Supercharging Databricks with Contextual Intelligence

Eliminate Hops in Your Streaming Architecture with Zerobus, Part of Lakeflow Connect

2025-06-12

talk

Victoria Bukta (Databricks) , Nikola Obradovic (Databricks)

AI/ML Analytics API Data Lakehouse IoT Data Streaming

In this session, we’ll introduce Zerobus Direct Write API, part of Lakeflow Connect, which enables you to push data directly to your lakehouse and simplify ingestion for IOT, clickstreams, telemetry, and more. We’ll start with an overview of the ingestion landscape to date. Then, we'll cover how you can “shift left” with Zerobus, embedding data ingestion into your operational systems to make analytics and AI a core component of the business, rather than an afterthought. The result is a significantly simpler architecture that scales your operations, using this new paradigm to skip unnecessary hops. We'll also highlight one of our early customers, Joby Aviation and how they use Zerobus. Finally, we’ll provide a framework to help you understand when to use Zerobus versus other ingestion offerings—and we’ll wrap up with a live Q&A so that you can hit the ground running with your own use cases.

What’s New in Apache Spark™ 4.0?

2025-06-12 Watch

talk

Wenchen Fan (Databricks) , Daniel Tenedorio (Databricks)

AI/ML API Java Python Scala Spark

Join this session for a concise tour of Apache Spark™ 4.0’s most notable enhancements: SQL features: ANSI by default, scripting, SQL pipe syntax, SQL UDF, session variable, view schema evolution, etc. Data type: VARIANT type, string collation Python features: Python data source, plotting API, etc. Streaming improvements: State store data source, state store checkpoint v2, arbitrary state v2, etc. Spark Connect improvements: More API coverage, thin client, unified Scala interface, etc. Infrastructure: Better error message, structured logging, new Java/Scala version support, etc. Whether you’re a seasoned Spark user or new to the ecosystem, this talk will prepare you to leverage Spark 4.0’s latest innovations for modern data and AI pipelines.

Healthcare Interoperability: End-to-End Streaming FHIR Pipelines With Databricks & Redox

2025-06-12 Watch

talk

Tim Kessler (Redox, Inc.) , Matthew Giglia (Databricks)

AI/ML API Amazon EMR BI Data Lakehouse Databricks

Redox & Databricks direct integration can streamline your interoperability workflows from responding in record time to preauthorization requests to letting attending physicians know about a change in risk for sepsis and readmission in near real time from ADTs. Data engineers will learn how to create fully-streaming ETL pipelines for ingesting, parsing and acting on insights from Redox FHIR bundles delivered directly to Unity Catalog volumes. Once available in the Lakehouse, AI/BI Dashboards and Agentic Frameworks help write FHIR messages back to Redox for direct push down to EMR systems. Parsing FHIR bundle resources has never been easier with SQL combined with the new VARIANT data type in Delta and streaming table creation against Serverless DBSQL Warehouses. We'll also use Databricks accelerators dbignite and redoxwrite for writing and posting FHIR bundles back to Redox integrated EMRs and we'll extend AI/BI with Unity Catalog SQL UDFs and the Redox API for use in Genie.

Introducing Simplified State Tracking in Apache Spark™ Structured Streaming

2025-06-12 Watch

lightning_talk

Craig Lukasik (Databricks)

API Spark Data Streaming

This presentation will review the new change feed and snapshot capabilities in Apache Spark™ Structured Streaming’s State Reader API. The State Reader API enables users to access and analyze Structured Streaming's internal state data. Readers will learn how to leverage the new features to debug, troubleshoot and analyze state changes efficiently, making streaming workloads easier to manage at scale.

Powering Secure and Scalable Data Governance at PepsiCo With Unity Catalog Open APIs

2025-06-12 Watch

talk

Dipankar Kushari (Databricks) , Sudipta Das (PepsiCo)

AI/ML Analytics API Data Governance Databricks

PepsiCo, given its scale, has numerous teams leveraging different tools and engines to access data and perform analytics and AI. To streamline governance across this diverse ecosystem, PepsiCo unifies its data and AI assets under an open and enterprise-grade governance framework with Unity Catalog. In this session, we'll explore real-world examples of how PepsiCo extends Unity Catalog’s governance to all its data and AI assets, enabling secure collaboration even for teams outside Databricks. Learn how PepsiCo architects permissions using service principals and service accounts to authenticate with Unity Catalog, building a multi-engine architecture with seamless and open governance. Attendees will gain practical insights into designing a scalable, flexible data platform that unifies governance across all teams while embracing openness and interoperability.

Next-Gen Data Science: How Posit and Databricks Are Transforming Analytics at Scale

2025-06-11 Watch

lightning_talk

James Blair (Posit, PBC)

Analytics API Data Science Databricks Python

Modern data science teams face the challenge of navigating complex landscapes of languages, tools and infrastructure. Positron, Posit’s next-generation IDE, offers a powerful environment tailored for data science, seamlessly integrating with Databricks to empower teams working in Python and R. Now integrated within Posit Workbench, Positron enables data scientists to efficiently develop, iterate and analyze data with Databricks — all while maintaining their preferred workflows. In this session, we’ll explore how Python and R users can develop, deploy and scale their data science workflows by combining Posit tools with Databricks. We’ll showcase how Positron simplifies development for both Python and R and how Posit Connect enables seamless deployment of applications, reports and APIs powered by Databricks. Join us to see how Posit + Databricks create a frictionless, scalable and collaborative data science experience — so your teams can focus on insights, not infrastructure.

American Airlines Flies to New Heights with Data Intelligence

2025-06-11 Watch

talk

Saimahesh Chava (American Airlines) , Yash Joshi (American Airlines)

API Data Governance Databricks GitHub Hive

American Airlines migrated from Hive Metastore to Unity Catalog using automated processes with Databricks APIs and GitHub Actions. This automation streamlined the migration for many applications within AA, ensuring consistency, efficiency and minimal disruption while enhancing data governance and disaster recovery capabilities.

Building Tool-Calling Agents With Databricks Agent Framework and MCP

2025-06-11 Watch

talk

Siddharth Murching (Databricks) , Elise Gonzales (Databricks)

AI/ML API Databricks Cyber Security

Want to create AI agents that can do more than just generate text? Join us to explore how combining Databricks' Mosaic AI Agent Framework with the Model Context Protocol (MCP) unlocks powerful tool-calling capabilities. We'll show you how MCP provides a standardized way for AI agents to interact with external tools, data and APIs, solving the headache of fragmented integration approaches. Learn to build agents that can retrieve both structured and unstructured data, execute custom code and tackle real enterprise challenges. Key takeaways: Implementing MCP-enabled tool-calling in your AI agents Prototyping in AI Playground and exporting for deployment Integrating Unity Catalog functions as agent tools Ensuring governance and security for enterprise deployments Whether you're building customer service bots or data analysis assistants, you'll leave with practical know-how to create powerful, governed AI agents.

Extending the Lakehouse: Power Interoperable Compute With Unity Catalog Open APIs

2025-06-11 Watch

talk

Tathagata Das (Databricks) , Michelle Leon (Databricks)

Flink API Data Lakehouse DuckDB Iceberg Cyber Security

The lakehouse is built for storage flexibility, but what about compute? In this session, we’ll explore how Unity Catalog enables you to connect and govern multiple compute engines across your data ecosystem. With open APIs and support for the Iceberg REST Catalog, UC lets you extend access to engines like Trino, DuckDB, and Flink while maintaining centralized security, lineage, and interoperability. We will show how you can get started today working with engines like Apache Spark and Starburst to read and write to UC managed tables with some exciting demos. Learn how to bring flexibility to your compute layer—without compromising control.

Intuit's Privacy-Safe Lending Marketplace: Leveraging Databricks Clean Rooms

2025-06-11 Watch

talk

Anurag Malik (Intuit Inc.)

AI/ML Analytics API Cloud Computing Databricks

Intuit leverages Databricks Clean Rooms to create a secure, privacy-safe lending marketplace, enabling small business lending partners to perform analytics and deploy ML/AI workflows on sensitive data assets. This session explores the technical foundations of building isolated clean rooms across multiple partners and cloud providers, differentiating Databricks Clean Rooms from market alternatives. We'll demonstrate our automated approach to clean room lifecycle management using APIs, covering creation, collaborator onboarding, data asset sharing, workflow orchestration and activity auditing. The integration with Unity Catalog for managing clean room inputs and outputs will also be discussed. Attendees will gain insights into harnessing collaborative ML/AI potential, support various languages and workloads, and enable complex computations without compromising sensitive information in Clean Rooms.

Apache Spark — Ask Us Anything

2025-06-11

lightning_talk

Allison Wang (Databricks) , Jules Damji (Databricks) , DB Tsai (Databricks)

API Big Data Spark

Join us for an interactive Ask Me Anything (AMA) session on the latest advancements in Apache Spark 4, including Spark Connect — the new client-server architecture enabling seamless integration with IDEs, notebooks and custom applications. Learn about performance improvements, enhanced APIs and best practices for leveraging Spark’s next-generation features. Whether you're a data engineer, Spark developer or big data enthusiast, bring your questions on architecture, real-world use cases and how these innovations can optimize your workflows. Don’t miss this chance to dive deep into the future of distributed computing with Spark!

Sponsored by: Boomi, LP | From Pipelines to Agents: Manage Data and AI on One Platform for Maximum ROI

Breaking Barriers: Building Custom Spark 4.0 Data Connectors with Python

2025-06-11 Watch

talk

Sourav Gulati (Databricks) , Ashish Saraswat (Databricks)

API Java Python Scala Spark Data Streaming

Building a custom Spark data source connector once required Java or Scala expertise, making it complex and limiting. This left many proprietary data sources without public SDKs disconnected from Spark. Additionally, data sources with Python SDKs couldn't harness Spark’s distributed power. Spark 4.0 changes this with a new Python API for data source connectors, allowing developers to build fully functional connectors without Java or Scala. This unlocks new possibilities, from integrating proprietary systems to leveraging untapped data sources. Supporting both batch and streaming, this API makes data ingestion more flexible than ever. In this talk, we’ll demonstrate how to build a Spark connector for Excel using Python, showcasing schema inference, data reads/writes and streaming support. Whether you're a data engineer or Spark enthusiast, you’ll gain the knowledge to integrate Spark with any data source — entirely in Python.

Sponsored by: Google Cloud | Building Powerful Agentic Ecosystems with Google Cloud's A2A

The Upcoming Apache Spark 4.1: The Next Chapter in Unified Analytics

2025-06-11 Watch

talk

DB Tsai (Databricks) , Xiao Li (Databricks)

Analytics API Data Quality ETL/ELT PySpark Python

Apache Spark has long been recognized as the leading open-source unified analytics engine, combining a simple yet powerful API with a rich ecosystem and top-notch performance. In the upcoming Spark 4.1 release, the community reimagines Spark to excel at both massive cluster deployments and local laptop development. We’ll start with new single-node optimizations that make PySpark even more efficient for smaller datasets. Next, we’ll delve into a major “Pythonizing” overhaul — simpler installation, clearer error messages and Pythonic APIs. On the ETL side, we’ll explore greater data source flexibility (including the simplified Python Data Source API) and a thriving UDF ecosystem. We’ll also highlight enhanced support for real-time use cases, built-in data quality checks and the expanding Spark Connect ecosystem — bridging local workflows with fully distributed execution. Don’t miss this chance to see Spark’s next chapter!

Bridging Big Data and AI: Empowering PySpark With Lance Format for Multi-Modal AI Data Pipelines

2025-06-11 Watch

lightning_talk

Allison Wang (Databricks) , LU QIU (LanceDB)

AI/ML Analytics API Big Data Data Analytics Lance

PySpark has long been a cornerstone of big data processing, excelling in data preparation, analytics and machine learning tasks within traditional data lakes. However, the rise of multimodal AI and vector search introduces challenges beyond its capabilities. Spark’s new Python data source API enables integration with emerging AI data lakes built on the multi-modal Lance format. Lance delivers unparalleled value with its zero-copy schema evolution capability and robust support for large record-size data (e.g., images, tensors, embeddings, etc), simplifying multimodal data storage. Its advanced indexing for semantic and full-text search, combined with rapid random access, enables high-performance AI data analytics to the level of SQL. By unifying PySpark's robust processing capabilities with Lance's AI-optimized storage, data engineers and scientists can efficiently manage and analyze the diverse data types required for cutting-edge AI applications within a familiar big data framework.

Sponsored by: Fivetran | Scalable Data Ingestion: Building custom pipelines with the Fivetran Connector SDK and Databricks

What’s New in PySpark: TVFs, Subqueries, Plots, and Profilers

2025-06-11 Watch

talk

Takuya Ueshin (Databricks) , Xinrong Meng (Databricks)

API PySpark Python

PySpark’s DataFrame API is evolving to support more expressive and modular workflows. In this session, we’ll introduce two powerful additions: table-valued functions (TVFs) and the new subquery API. You’ll learn how to define custom TVFs using Python User-Defined Table Functions (UDTFs), including support for polymorphism, and how subqueries can simplify complex logic. We’ll also explore how lateral joins connect these features, followed by practical tools for the PySpark developer experience—such as plotting, profiling, and a preview of upcoming capabilities like UDF logging and a Python-native data source API. Whether you're building production pipelines or extending PySpark itself, this talk will help you take full advantage of the latest features in the PySpark ecosystem.

talk-data.com

Top Topics

Top Speakers

Advanced Governance and Auth With Databricks Apps

Automating Taxonomy Generation With Compound AI on Databricks

Breaking Up With Spark Versions: Client APIs, AI-Powered Automatic Updates, and Dependency Management for Databricks Serverless

Evaluation-Driven Development Workflows: Best Practices and Real-World Scenarios

Supercharging Sales Intelligence: Processing Billions of Events via Structured Streaming

Sponsored by: DataHub | Beyond the Lakehouse: Supercharging Databricks with Contextual Intelligence

Eliminate Hops in Your Streaming Architecture with Zerobus, Part of Lakeflow Connect

What’s New in Apache Spark™ 4.0?

Healthcare Interoperability: End-to-End Streaming FHIR Pipelines With Databricks & Redox

Introducing Simplified State Tracking in Apache Spark™ Structured Streaming

Powering Secure and Scalable Data Governance at PepsiCo With Unity Catalog Open APIs

Next-Gen Data Science: How Posit and Databricks Are Transforming Analytics at Scale

American Airlines Flies to New Heights with Data Intelligence

Building Tool-Calling Agents With Databricks Agent Framework and MCP

Extending the Lakehouse: Power Interoperable Compute With Unity Catalog Open APIs

Intuit's Privacy-Safe Lending Marketplace: Leveraging Databricks Clean Rooms

Apache Spark — Ask Us Anything

Sponsored by: Boomi, LP | From Pipelines to Agents: Manage Data and AI on One Platform for Maximum ROI

Sponsored by: EY | Unlocking Value Through AI at Takeda Pharmaceuticals

Breaking Barriers: Building Custom Spark 4.0 Data Connectors with Python

Sponsored by: Google Cloud | Building Powerful Agentic Ecosystems with Google Cloud's A2A

The Upcoming Apache Spark 4.1: The Next Chapter in Unified Analytics

Bridging Big Data and AI: Empowering PySpark With Lance Format for Multi-Modal AI Data Pipelines

Sponsored by: Fivetran | Scalable Data Ingestion: Building custom pipelines with the Fivetran Connector SDK and Databricks

What’s New in PySpark: TVFs, Subqueries, Plots, and Profilers