talk-data.com talk-data.com

Event

Data + AI Summit 2025

2025-06-09 – 2025-06-13 Databricks Summit Visit website ↗

Activities tracked

52

Filtering by: API ×

Sessions & talks

Showing 1–25 of 52 · Newest first

Search within this event →
Advanced Governance and Auth With Databricks Apps

Advanced Governance and Auth With Databricks Apps

2025-06-12 Watch
talk
Andre Furlan Bueno (Databricks) , Doug Judice (Addepar)

Explore advanced governance and authentication patterns for building secure, enterprise-grade apps with Databricks Apps. Learn how to configure complex permissions and manage access control using Unity Catalog. We’ll dive into “on-behalf-of-user” authentication — allowing agents to enforce user-specific access controls — and cover API-based authentication, including PATs and OAuth flows for external integrations. We’ll also highlight how Addepar uses these capabilities to securely build and scale applications that handle sensitive financial data. Whether you're building internal tools or customer-facing apps, this session will equip you with the patterns and tools to ensure robust, secure access in your Databricks apps.

Automating Taxonomy Generation With Compound AI on Databricks

Automating Taxonomy Generation With Compound AI on Databricks

2025-06-12 Watch
talk
Allistair Cota (Lovelytics) , Sudhir Gajre (Lovelytics)

Taxonomy generation is a challenge across industries such as retail, manufacturing and e-commerce. Incomplete or inconsistent taxonomies can lead to fragmented data insights, missed monetization opportunities and stalled revenue growth. In this session, we will explore a modern approach to solving this problem by leveraging Databricks platform to build a scalable compound AI architecture for automated taxonomy generation. The first half of the session will walk you through the business significance and implications of taxonomy, followed by a technical deep dive in building an architecture for taxonomy implementation on the Databricks platform using a compound AI architecture. We will walk attendees through the anatomy of taxonomy generation, showcasing an innovative solution that combines multimodal and text-based LLMs, internal data sources and external API calls. This ensemble approach ensures more accurate, comprehensive and adaptable taxonomies that align with business needs.

Breaking Up With Spark Versions: Client APIs, AI-Powered Automatic Updates, and Dependency Management for Databricks Serverless

Breaking Up With Spark Versions: Client APIs, AI-Powered Automatic Updates, and Dependency Management for Databricks Serverless

2025-06-12 Watch
talk
Justin Breese (Databricks)

This session explains how we've made our Apache Spark™ versionless for end users by introducing a stable client API, environment versioning and automatic remediation. These capabilities have enabled auto-upgrade of hundreds of millions of workloads with minimal disruption for Serverless Notebooks and Jobs. We'll also introduce a new approach to dependency management using environments. Admins will learn how to speed up package installation with Default Base Environments, and users will see how to manage custom environments for their own workloads.

Evaluation-Driven Development Workflows: Best Practices and Real-World Scenarios

Evaluation-Driven Development Workflows: Best Practices and Real-World Scenarios

2025-06-12 Watch
talk
Wenwen Xie (Databricks) , Arthur Dooner (Databricks)

In enterprise AI, Evaluation-Driven Development (EDD) ensures reliable, efficient systems by embedding continuous assessment and improvement into the AI development lifecycle. High-quality evaluation datasets are created using techniques like document analysis, synthetic data generation via Mosaic AI’s synthetic data generation API, SME validation, and relevance filtering, reducing manual effort and accelerating workflows. EDD focuses on metrics such as context relevance, groundedness, and response accuracy to identify and address issues like retrieval errors or model limitations. Custom LLM judges, tailored to domain-specific needs like PII detection or tone assessment, enhance evaluations. By leveraging tools like Mosaic AI Agent Framework and Agent Evaluation, MLflow, EDD automates data tracking, streamlines workflows, and quantifies improvements, transforming AI development for delivering scalable, high-performing systems that drive measurable organizational value.

Supercharging Sales Intelligence: Processing Billions of Events via Structured Streaming

Supercharging Sales Intelligence: Processing Billions of Events via Structured Streaming

2025-06-12 Watch
talk
Anurag Bharati (DigiCert) , Nikita Raje (DigiCert)

DigiCert is a digital security company that provides digital certificates, encryption and authentication services and serves 88% of the Fortune 500, securing over 28 billion web connections daily. Our project aggregates and analyzes certificate transparency logs via public APIs to provide comprehensive market and competitive intelligence. Instead of relying on third-party providers with limited data, our project gives full control, deeper insights and automation. Databricks has helped us reliably poll public APIs in a scalable manner that fetches millions of events daily, deduplicate and store them in our Delta tables. We specifically use Spark for parallel processing, structured streaming for real-time ingestion and deduplication, Delta tables for data reliability, pools and jobs to ensure our costs are optimized. These technologies help us keep our data fresh, accurate and cost effective. This data has helped our sales team with real-time intelligence, ensuring DigiCert's success.

Sponsored by: DataHub | Beyond the Lakehouse: Supercharging Databricks with Contextual Intelligence

Sponsored by: DataHub | Beyond the Lakehouse: Supercharging Databricks with Contextual Intelligence

2025-06-12 Watch
lightning_talk
Gabriel Lyons (Datahub)

While Databricks powers your data lakehouse, DataHub delivers the critical context layer connecting your entire ecosystem. We'll demonstrate how DataHub extends Unity Catalog to provide comprehensive metadata intelligence across platforms. DataHub's real-time platform:Cut AI model time-to-market with our unified REST and GraphQL APIs that ensure models train on reliable and compliant data from across platforms, with complete lineage trackingDecrease data incidents by 60% using our event-driven architecture that instantly propagates changes across systems*Transform data discovery from days to minutes with AI-powered search and natural language interfaces.Leaders use DataHub to transform Databricks data into integrated insights that drive business value. See our demo of syncback technology—detecting sensitive data and enforcing Databricks access controls automatically—plus our AI assistant that enhances' LLMs with cross-platform metadata.

Eliminate Hops in Your Streaming Architecture with Zerobus, Part of Lakeflow Connect

2025-06-12
talk
Victoria Bukta (Databricks) , Nikola Obradovic (Databricks)

In this session, we’ll introduce Zerobus Direct Write API, part of Lakeflow Connect, which enables you to push data directly to your lakehouse and simplify ingestion for IOT, clickstreams, telemetry, and more. We’ll start with an overview of the ingestion landscape to date. Then, we'll cover how you can “shift left” with Zerobus, embedding data ingestion into your operational systems to make analytics and AI a core component of the business, rather than an afterthought. The result is a significantly simpler architecture that scales your operations, using this new paradigm to skip unnecessary hops. We'll also highlight one of our early customers, Joby Aviation and how they use Zerobus. Finally, we’ll provide a framework to help you understand when to use Zerobus versus other ingestion offerings—and we’ll wrap up with a live Q&A so that you can hit the ground running with your own use cases.

What’s New in Apache Spark™ 4.0?

What’s New in Apache Spark™ 4.0?

2025-06-12 Watch
talk
Wenchen Fan (Databricks) , Daniel Tenedorio (Databricks)

Join this session for a concise tour of Apache Spark™ 4.0’s most notable enhancements: SQL features: ANSI by default, scripting, SQL pipe syntax, SQL UDF, session variable, view schema evolution, etc. Data type: VARIANT type, string collation Python features: Python data source, plotting API, etc. Streaming improvements: State store data source, state store checkpoint v2, arbitrary state v2, etc. Spark Connect improvements: More API coverage, thin client, unified Scala interface, etc. Infrastructure: Better error message, structured logging, new Java/Scala version support, etc. Whether you’re a seasoned Spark user or new to the ecosystem, this talk will prepare you to leverage Spark 4.0’s latest innovations for modern data and AI pipelines.

Healthcare Interoperability: End-to-End Streaming FHIR Pipelines With Databricks & Redox

Healthcare Interoperability: End-to-End Streaming FHIR Pipelines With Databricks & Redox

2025-06-12 Watch
talk
Tim Kessler (Redox, Inc.) , Matthew Giglia (Databricks)

Redox & Databricks direct integration can streamline your interoperability workflows from responding in record time to preauthorization requests to letting attending physicians know about a change in risk for sepsis and readmission in near real time from ADTs. Data engineers will learn how to create fully-streaming ETL pipelines for ingesting, parsing and acting on insights from Redox FHIR bundles delivered directly to Unity Catalog volumes. Once available in the Lakehouse, AI/BI Dashboards and Agentic Frameworks help write FHIR messages back to Redox for direct push down to EMR systems. Parsing FHIR bundle resources has never been easier with SQL combined with the new VARIANT data type in Delta and streaming table creation against Serverless DBSQL Warehouses. We'll also use Databricks accelerators dbignite and redoxwrite for writing and posting FHIR bundles back to Redox integrated EMRs and we'll extend AI/BI with Unity Catalog SQL UDFs and the Redox API for use in Genie.

Introducing Simplified State Tracking in Apache Spark™ Structured Streaming

Introducing Simplified State Tracking in Apache Spark™ Structured Streaming

2025-06-12 Watch
lightning_talk
Craig Lukasik (Databricks)

This presentation will review the new change feed and snapshot capabilities in Apache Spark™ Structured Streaming’s State Reader API. The State Reader API enables users to access and analyze Structured Streaming's internal state data. Readers will learn how to leverage the new features to debug, troubleshoot and analyze state changes efficiently, making streaming workloads easier to manage at scale.

Powering Secure and Scalable Data Governance at PepsiCo With Unity Catalog Open APIs

Powering Secure and Scalable Data Governance at PepsiCo With Unity Catalog Open APIs

2025-06-12 Watch
talk
Dipankar Kushari (Databricks) , Sudipta Das (PepsiCo)

PepsiCo, given its scale, has numerous teams leveraging different tools and engines to access data and perform analytics and AI. To streamline governance across this diverse ecosystem, PepsiCo unifies its data and AI assets under an open and enterprise-grade governance framework with Unity Catalog. In this session, we'll explore real-world examples of how PepsiCo extends Unity Catalog’s governance to all its data and AI assets, enabling secure collaboration even for teams outside Databricks. Learn how PepsiCo architects permissions using service principals and service accounts to authenticate with Unity Catalog, building a multi-engine architecture with seamless and open governance. Attendees will gain practical insights into designing a scalable, flexible data platform that unifies governance across all teams while embracing openness and interoperability.

Next-Gen Data Science: How Posit and Databricks Are Transforming Analytics at Scale

Next-Gen Data Science: How Posit and Databricks Are Transforming Analytics at Scale

2025-06-11 Watch
lightning_talk
James Blair (Posit, PBC)

Modern data science teams face the challenge of navigating complex landscapes of languages, tools and infrastructure. Positron, Posit’s next-generation IDE, offers a powerful environment tailored for data science, seamlessly integrating with Databricks to empower teams working in Python and R. Now integrated within Posit Workbench, Positron enables data scientists to efficiently develop, iterate and analyze data with Databricks — all while maintaining their preferred workflows. In this session, we’ll explore how Python and R users can develop, deploy and scale their data science workflows by combining Posit tools with Databricks. We’ll showcase how Positron simplifies development for both Python and R and how Posit Connect enables seamless deployment of applications, reports and APIs powered by Databricks. Join us to see how Posit + Databricks create a frictionless, scalable and collaborative data science experience — so your teams can focus on insights, not infrastructure.

American Airlines Flies to New Heights with Data Intelligence

American Airlines Flies to New Heights with Data Intelligence

2025-06-11 Watch
talk
Saimahesh Chava (American Airlines) , Yash Joshi (American Airlines)

American Airlines migrated from Hive Metastore to Unity Catalog using automated processes with Databricks APIs and GitHub Actions. This automation streamlined the migration for many applications within AA, ensuring consistency, efficiency and minimal disruption while enhancing data governance and disaster recovery capabilities.

Building Tool-Calling Agents With Databricks Agent Framework and MCP

Building Tool-Calling Agents With Databricks Agent Framework and MCP

2025-06-11 Watch
talk
Siddharth Murching (Databricks) , Elise Gonzales (Databricks)

Want to create AI agents that can do more than just generate text? Join us to explore how combining Databricks' Mosaic AI Agent Framework with the Model Context Protocol (MCP) unlocks powerful tool-calling capabilities. We'll show you how MCP provides a standardized way for AI agents to interact with external tools, data and APIs, solving the headache of fragmented integration approaches. Learn to build agents that can retrieve both structured and unstructured data, execute custom code and tackle real enterprise challenges. Key takeaways: Implementing MCP-enabled tool-calling in your AI agents Prototyping in AI Playground and exporting for deployment Integrating Unity Catalog functions as agent tools Ensuring governance and security for enterprise deployments Whether you're building customer service bots or data analysis assistants, you'll leave with practical know-how to create powerful, governed AI agents.

Extending the Lakehouse: Power Interoperable Compute With Unity Catalog Open APIs

Extending the Lakehouse: Power Interoperable Compute With Unity Catalog Open APIs

2025-06-11 Watch
talk
Tathagata Das (Databricks) , Michelle Leon (Databricks)

The lakehouse is built for storage flexibility, but what about compute? In this session, we’ll explore how Unity Catalog enables you to connect and govern multiple compute engines across your data ecosystem. With open APIs and support for the Iceberg REST Catalog, UC lets you extend access to engines like Trino, DuckDB, and Flink while maintaining centralized security, lineage, and interoperability. We will show how you can get started today working with engines like Apache Spark and Starburst to read and write to UC managed tables with some exciting demos. Learn how to bring flexibility to your compute layer—without compromising control.

Intuit's Privacy-Safe Lending Marketplace: Leveraging Databricks Clean Rooms

Intuit's Privacy-Safe Lending Marketplace: Leveraging Databricks Clean Rooms

2025-06-11 Watch
talk
Anurag Malik (Intuit Inc.)

Intuit leverages Databricks Clean Rooms to create a secure, privacy-safe lending marketplace, enabling small business lending partners to perform analytics and deploy ML/AI workflows on sensitive data assets. This session explores the technical foundations of building isolated clean rooms across multiple partners and cloud providers, differentiating Databricks Clean Rooms from market alternatives. We'll demonstrate our automated approach to clean room lifecycle management using APIs, covering creation, collaborator onboarding, data asset sharing, workflow orchestration and activity auditing. The integration with Unity Catalog for managing clean room inputs and outputs will also be discussed. Attendees will gain insights into harnessing collaborative ML/AI potential, support various languages and workloads, and enable complex computations without compromising sensitive information in Clean Rooms.

Apache Spark — Ask Us Anything

2025-06-11
lightning_talk
Allison Wang (Databricks) , Jules Damji (Databricks) , DB Tsai (Databricks)

Join us for an interactive Ask Me Anything (AMA) session on the latest advancements in Apache Spark 4, including Spark Connect — the new client-server architecture enabling seamless integration with IDEs, notebooks and custom applications. Learn about performance improvements, enhanced APIs and best practices for leveraging Spark’s next-generation features. Whether you're a data engineer, Spark developer or big data enthusiast, bring your questions on architecture, real-world use cases and how these innovations can optimize your workflows. Don’t miss this chance to dive deep into the future of distributed computing with Spark!

Sponsored by: Boomi, LP | From Pipelines to Agents: Manage Data and AI on One Platform for Maximum ROI

Sponsored by: Boomi, LP | From Pipelines to Agents: Manage Data and AI on One Platform for Maximum ROI

2025-06-11 Watch
talk

In the age of agentic AI, competitive advantage lies not only in AI models, but in the quality of the data agents reason on and the agility of the tools that feed them. To fully realize the ROI of agentic AI, organizations need a platform that enables high-quality data pipelines and provides scalable, enterprise-grade tools. In this session, discover how a unified platform for integration, data management, MCP server management, API management, and agent orchestration can help you to bring cohesion and control to how data and agents are used across your organization.

Sponsored by: EY | Unlocking Value Through AI at Takeda Pharmaceuticals

Sponsored by: EY | Unlocking Value Through AI at Takeda Pharmaceuticals

2025-06-11 Watch
lightning_talk
Naveed Afzal (Takeda)

In the rapidly evolving landscape of pharmaceuticals, the integration of AI and GenAI is transforming how organizations operate and deliver value. We will explore the profound impact of the AI program at Takeda Pharmaceuticals and the central role of Databricks. We will delve into eight pivotal AI/GenAI use cases that enhance operational efficiency across commercial, R&D, manufacturing, and back-office functions, including these capabilities: Responsible AI Guardrails: Scanners that validate and enforce responsible AI controls on GenAI solutions Reusable Databricks Native Vectorization Pipeline: A scalable solution enhancing data processing with quality and governance One-Click Deployable RAG Pattern: Simplifying deployment for AI applications, enabling rapid experimentation and innovation AI Asset Registry: A repository for foundational models, vector stores, and APIs, promoting reuse and collaboration

Breaking Barriers: Building Custom Spark 4.0 Data Connectors with Python

Breaking Barriers: Building Custom Spark 4.0 Data Connectors with Python

2025-06-11 Watch
talk
Sourav Gulati (Databricks) , Ashish Saraswat (Databricks)

Building a custom Spark data source connector once required Java or Scala expertise, making it complex and limiting. This left many proprietary data sources without public SDKs disconnected from Spark. Additionally, data sources with Python SDKs couldn't harness Spark’s distributed power. Spark 4.0 changes this with a new Python API for data source connectors, allowing developers to build fully functional connectors without Java or Scala. This unlocks new possibilities, from integrating proprietary systems to leveraging untapped data sources. Supporting both batch and streaming, this API makes data ingestion more flexible than ever. In this talk, we’ll demonstrate how to build a Spark connector for Excel using Python, showcasing schema inference, data reads/writes and streaming support. Whether you're a data engineer or Spark enthusiast, you’ll gain the knowledge to integrate Spark with any data source — entirely in Python.

Sponsored by: Google Cloud | Building Powerful Agentic Ecosystems with Google Cloud's A2A

Sponsored by: Google Cloud | Building Powerful Agentic Ecosystems with Google Cloud's A2A

2025-06-11 Watch
talk
Sivakumar Nagapandi (SAP) , Naveen Punjabi (Google Cloud) , Matt Kixmoeller (Glean) , Sean Falconer (Confluent)

This session unveils Google Cloud's Agent2Agent (A2A) protocol, ushering in a new era of AI interoperability where diverse agents collaborate seamlessly to solve complex enterprise challenges. Join our panel of experts to discover how A2A empowers you to deeply integrate these collaborative AI systems with your existing enterprise data, custom APIs, and critical workflows. Ultimately, learn to build more powerful, versatile, and securely managed agentic ecosystems by combining specialized Google-built agents with your own custom solutions (Vertex AI or no-code). Extend this ecosystem further by serving these agents with Databricks Model Serving and governing them with Unity Catalog for consistent security and management across your enterprise.

The Upcoming Apache Spark 4.1: The Next Chapter in Unified Analytics

The Upcoming Apache Spark 4.1: The Next Chapter in Unified Analytics

2025-06-11 Watch
talk
DB Tsai (Databricks) , Xiao Li (Databricks)

Apache Spark has long been recognized as the leading open-source unified analytics engine, combining a simple yet powerful API with a rich ecosystem and top-notch performance. In the upcoming Spark 4.1 release, the community reimagines Spark to excel at both massive cluster deployments and local laptop development. We’ll start with new single-node optimizations that make PySpark even more efficient for smaller datasets. Next, we’ll delve into a major “Pythonizing” overhaul — simpler installation, clearer error messages and Pythonic APIs. On the ETL side, we’ll explore greater data source flexibility (including the simplified Python Data Source API) and a thriving UDF ecosystem. We’ll also highlight enhanced support for real-time use cases, built-in data quality checks and the expanding Spark Connect ecosystem — bridging local workflows with fully distributed execution. Don’t miss this chance to see Spark’s next chapter!

Bridging Big Data and AI: Empowering PySpark With Lance Format for Multi-Modal AI Data Pipelines

Bridging Big Data and AI: Empowering PySpark With Lance Format for Multi-Modal AI Data Pipelines

2025-06-11 Watch
lightning_talk
Allison Wang (Databricks) , LU QIU (LanceDB)

PySpark has long been a cornerstone of big data processing, excelling in data preparation, analytics and machine learning tasks within traditional data lakes. However, the rise of multimodal AI and vector search introduces challenges beyond its capabilities. Spark’s new Python data source API enables integration with emerging AI data lakes built on the multi-modal Lance format. Lance delivers unparalleled value with its zero-copy schema evolution capability and robust support for large record-size data (e.g., images, tensors, embeddings, etc), simplifying multimodal data storage. Its advanced indexing for semantic and full-text search, combined with rapid random access, enables high-performance AI data analytics to the level of SQL. By unifying PySpark's robust processing capabilities with Lance's AI-optimized storage, data engineers and scientists can efficiently manage and analyze the diverse data types required for cutting-edge AI applications within a familiar big data framework.

Sponsored by: Fivetran | Scalable Data Ingestion: Building custom pipelines with the Fivetran Connector SDK and Databricks

Sponsored by: Fivetran | Scalable Data Ingestion: Building custom pipelines with the Fivetran Connector SDK and Databricks

2025-06-11 Watch
talk
Kelly Kohlleffel (Fivetran) , CL Abeel (Fivetran)

Organizations have hundreds of data sources, some of which are very niche or difficult to access. Incorporating this data into your lakehouse requires significant time and resources, hindering your ability to work on more value-add projects. Enter the Fivetran Connector SDK- a powerful new tool that enables your team to create custom pipelines for niche systems, custom APIs, and sources with specific data filtering requirements, seamlessly integrating with Databricks. During this session, Fivetran will demonstrate how to (1) Leverage the Connector SDK to build scalable connectors, enabling the ingestion of diverse data into Databricks (2) Gain flexibility and control over historical and incremental syncs, delete capture, state management, multithreading data extraction, and custom schemas (3) Utilize practical examples, code snippets, and architectural considerations to overcome data integration challenges and unlock the full potential of your Databricks environment.

What’s New in PySpark: TVFs, Subqueries, Plots, and Profilers

What’s New in PySpark: TVFs, Subqueries, Plots, and Profilers

2025-06-11 Watch
talk
Takuya Ueshin (Databricks) , Xinrong Meng (Databricks)

PySpark’s DataFrame API is evolving to support more expressive and modular workflows. In this session, we’ll introduce two powerful additions: table-valued functions (TVFs) and the new subquery API. You’ll learn how to define custom TVFs using Python User-Defined Table Functions (UDTFs), including support for polymorphism, and how subqueries can simplify complex logic. We’ll also explore how lateral joins connect these features, followed by practical tools for the PySpark developer experience—such as plotting, profiling, and a preview of upcoming capabilities like UDF logging and a Python-native data source API. Whether you're building production pipelines or extending PySpark itself, this talk will help you take full advantage of the latest features in the PySpark ecosystem.