API

Sponsored by: Google Cloud | Building Powerful Agentic Ecosystems with Google Cloud's A2A

The Upcoming Apache Spark 4.1: The Next Chapter in Unified Analytics

2025-06-11 · Data + AI Summit 2025 Watch

talk

by DB Tsai (Databricks) , Xiao Li (Databricks)

Analytics Data Quality ETL/ELT PySpark Python Spark

Apache Spark has long been recognized as the leading open-source unified analytics engine, combining a simple yet powerful API with a rich ecosystem and top-notch performance. In the upcoming Spark 4.1 release, the community reimagines Spark to excel at both massive cluster deployments and local laptop development. We’ll start with new single-node optimizations that make PySpark even more efficient for smaller datasets. Next, we’ll delve into a major “Pythonizing” overhaul — simpler installation, clearer error messages and Pythonic APIs. On the ETL side, we’ll explore greater data source flexibility (including the simplified Python Data Source API) and a thriving UDF ecosystem. We’ll also highlight enhanced support for real-time use cases, built-in data quality checks and the expanding Spark Connect ecosystem — bridging local workflows with fully distributed execution. Don’t miss this chance to see Spark’s next chapter!

Bridging Big Data and AI: Empowering PySpark With Lance Format for Multi-Modal AI Data Pipelines

2025-06-11 · Data + AI Summit 2025 Watch

lightning_talk

by LU QIU (LanceDB) , Allison Wang (Databricks)

AI/ML Analytics Big Data Data Analytics Lance PySpark Python Spark SQL

PySpark has long been a cornerstone of big data processing, excelling in data preparation, analytics and machine learning tasks within traditional data lakes. However, the rise of multimodal AI and vector search introduces challenges beyond its capabilities. Spark’s new Python data source API enables integration with emerging AI data lakes built on the multi-modal Lance format. Lance delivers unparalleled value with its zero-copy schema evolution capability and robust support for large record-size data (e.g., images, tensors, embeddings, etc), simplifying multimodal data storage. Its advanced indexing for semantic and full-text search, combined with rapid random access, enables high-performance AI data analytics to the level of SQL. By unifying PySpark's robust processing capabilities with Lance's AI-optimized storage, data engineers and scientists can efficiently manage and analyze the diverse data types required for cutting-edge AI applications within a familiar big data framework.

Sponsored by: Fivetran | Scalable Data Ingestion: Building custom pipelines with the Fivetran Connector SDK and Databricks

What’s New in PySpark: TVFs, Subqueries, Plots, and Profilers

2025-06-11 · Data + AI Summit 2025 Watch

talk

by Takuya Ueshin (Databricks) , Xinrong Meng (Databricks)

PySpark Python

PySpark’s DataFrame API is evolving to support more expressive and modular workflows. In this session, we’ll introduce two powerful additions: table-valued functions (TVFs) and the new subquery API. You’ll learn how to define custom TVFs using Python User-Defined Table Functions (UDTFs), including support for polymorphism, and how subqueries can simplify complex logic. We’ll also explore how lateral joins connect these features, followed by practical tools for the PySpark developer experience—such as plotting, profiling, and a preview of upcoming capabilities like UDF logging and a Python-native data source API. Whether you're building production pipelines or extending PySpark itself, this talk will help you take full advantage of the latest features in the PySpark ecosystem.

From Imperative to Declarative Paradigm: Rebuilding a CI/CD Infrastructure Using Hatch and DABs

2025-06-11 · Data + AI Summit 2025 Watch

talk

by Luigi Di Tacchio (FreeWheel, a Comcast Company)

CI/CD Databricks PySpark

Building and deploying Pyspark pipelines to Databricks should be effortless. However, our team at FreeWheel has, for the longest time, struggled with a convoluted and hard-to-maintain CI/CD infrastructure. It followed an imperative paradigm, demanding that every project implement custom scripts to build artifacts and deploy resources, and resulting in redundant boilerplate code and awkward interactions with the Databricks REST API. We set our mind on rebuilding it from scratch, following a declarative paradigm instead. We will share how we were able to eliminate thousands of lines of code from our repository, create a fully configuration-driven infrastructure where projects can be easily onboarded, and improve the quality of our codebase using Hatch and Databricks Asset Bundles as our tools of choice. In particular, DAB has made deploying across our 3 environments a breeze, and has allowed us to quickly adopt new features as soon as they are released by Databricks.

Real-Time Analytics Pipeline for IoT Device Monitoring and Reporting

2025-06-11 · Data + AI Summit 2025 Watch

talk

by Nayan Sharma (CKDelta) , Padraic Kirrane (CK Delta)

Analytics Cloud Computing Cloud Storage Data Quality Databricks IoT Data Streaming

This session will show how we implemented a solution to support high-frequency data ingestion from smart meters. We implemented a robust API endpoint that interfaces directly with IoT devices. This API processes messages in real time from millions of distributed IoT devices and meters across the network. The architecture leverages cloud storage as a landing zone for the raw data, followed by a streaming pipeline built on Lakeflow Declarative Pipelines. This pipeline implements a multi-layer medallion architecture to progressively clean, transform and enrich the data. The pipeline operates continuously to maintain near real-time data freshness in our gold layer tables. These datasets connect directly to Databricks Dashboards, providing stakeholders with immediate insights into their operational metrics. This solution demonstrates how modern data architecture can handle high-volume IoT data streams while maintaining data quality and providing accessible real-time analytics for business users.

Simplify Data Ingest and Egress with the New Python Data Source API

2025-06-11 · Data + AI Summit 2025 Watch

talk

by Craig Lukasik (Databricks)

Data Engineering Databricks Python Spark

Data engineering teams are frequently tasked with building bespoke ingest and/or egress solutions for myriad custom, proprietary, or industry-specific data sources or sinks. Many teams find this work cumbersome and time-consuming. Recognizing these challenges, Databricks interviewed numerous companies across different industries to better understand their diverse data integration needs. This comprehensive feedback led us to develop the Python Data Source API for Apache Spark™.

Turn Genie Into an Agent Using Conversation APIs

2025-06-11 · Data + AI Summit 2025 Watch

talk

by Prithvi Kannan (Databricks) , Hanlin Sun (Databricks)

AI/ML Analytics BI SQL

Transform your AI/BI Genie into a text-to-SQL powerhouse using the Genie Conversation APIs. This session explores how Genie functions as an intelligent agent, translating natural language queries into SQL to accelerate insights and enhance self-service analytics. You'll learn practical techniques for configuring agents, optimizing queries and handling errors — ensuring Genie delivers accurate, relevant responses in real time. A must-attend for teams looking to level up their AI/BI capabilities and deliver smarter analytics experiences.

The Future of DSv2 in Apache Spark™

2025-06-10 · Data + AI Summit 2025 Watch

talk

by Anton Okolnychyi (Databricks)

Spark

DSv2, Spark's next-generation Catalog API, is gaining traction among data source developers. It shifts complexity to Apache Spark™, improves connector reliability and unlocks new functionality such as catalog federation, MERGE operations, storage-partitioned joins, aggregate pushdown, stored procedures and more. This session covers the design of DSv2, current strengths and gaps and its evolving roadmap. It's intended for Spark users and developers working with data sources, whether custom-built or off-the-shelf.

Unified Advanced Analytics: Integrating Power BI and Databricks Genie for Real-time Insights

2025-06-10 · Data + AI Summit 2025 Watch

talk

by Justin Ward (TurnPoint Services) , Edelweiss Kammermann (IT Convergence)

Analytics Azure BI Dashboard Databricks Power BI

In today’s data-driven landscape, business users expect seamless, interactive analytics without having to switch between different environments. This presentation explores our web application that unifies a Power BI dashboard with Databricks Genie, allowing users to query and visualize insights from the same dataset within a single, cohesive interface. We will compare two integration strategies: one that leverages a traditional webpage enhanced by an Azure bot to incorporate Genie’s capabilities, and another that utilizes Databricks Apps to deliver a smoother, native experience. We use the Genie API to build this solution. Attendees will learn the architecture behind these solutions, key design considerations and challenges encountered during implementation. Join us to see live demos of both approaches, and discover best practices for delivering an all-in-one, interactive analytics experience.

Kernel, Catalog, Action! Reimagining our Delta-Spark Connector with DSv2

2025-06-10 · Data + AI Summit 2025 Watch

lightning_talk

by Scott Sandre (Databricks)

Data Lakehouse Delta Spark

Delta Lake is redesigning its Spark connector through the combination of three key technologies: First, we're updating our Spark APIs to DSv2 to achieve deeper catalog integration and improved integration with the Spark optimizer. Second, we're fully integrating on top of Delta Kernel to take advantage of its simplified abstraction of Delta protocol complexities, accelerating feature adoption and improving maintainability. Third, we are transforming Delta to become a catalog-aware lakehouse format with Catalog Commits, enabling more efficient metadata management, governance and query performance. Join us to explore how we're advancing Delta Lake's architecture, pushing the boundaries of metadata management and creating a more intelligent, performant data lakehouse platform.

Cloud-to-Cloud Data Sharing by Walmart: Direct Access to Omni-Channel Sales Data With Delta Sharing

2025-06-10 · Data + AI Summit 2025

talk

by Roberto Robles Nacif (Walmart Data Ventures) , Ajay Bhonsule (Walmart Inc.)

Analytics Cloud Computing Databricks Delta Omni

As first-party data becomes increasingly invaluable to organizations, Walmart Data Ventures is dedicated to bringing to life new applications of Walmart’s first-party data to better serve its customers. Through Scintilla, its integrated insights ecosystem, Walmart Data Ventures continues to expand its offerings to deliver insights and analytics that drive collaboration between our merchants, suppliers, and operators.Scintilla users can now access Walmart data using Cloud Feeds, based on Databricks Delta Sharing technologies. In the past, Walmart used API-based data sharing models, which required users to possess certain skills and technical attributes that weren’t always available. Now, with Cloud Feeds, Scintilla users can more easily access data without a dedicated technical team behind the scenes making it happen. Attendees will gain valuable insights into how Walmart has built its robust data sharing architecture and strategies to design scalable and collaborative data sharing architectures in their own organizations.

Delta Kernel for Rust and Java

2025-06-10 · Data + AI Summit 2025 Watch

talk

by Nick Lanham (Databricks)

AI/ML C#/.NET ClickHouse Delta DuckDB Java Rust

Delta Kernel makes it easy for engines and connectors to read and write Delta tables. It supports many Delta features and robust connectors, including DuckDB, Clickhouse, Spice AI and delta-dotnet. In this session, we'll cover lessons learned about how to build a high-performance library that lets engines integrate the way they want, while not having to worry about the details of the Delta protocol. We'll talk through how we streamlined the API as well as its changes and underlying motivations. We'll discuss some new highlight features like write support, and the ability to do CDF scans. Finally we'll cover the future roadmap for the Kernel project and what you can expect from the project over the coming year.

Gaining Insight From Image Data in Databricks Using Multi-Modal Foundation Model API

2025-06-10 · Data + AI Summit 2025 Watch

lightning_talk

by Ankit Mathur (Databricks)

Data Lakehouse Databricks LLM

Unlock the hidden potential in your image data without specialized computer vision expertise! This session explores how to leverage Databricks' multi-modal Foundation Model APIs to analyze, classify and extract insights from visual content. Learn how Databricks provides a unified API to understand images using powerful foundation models within your data workflows. Key takeaways: Implementing efficient workflows for image data processing within your Databricks lakehouse Understanding multi-modal foundation models for image understanding Integrating image analysis with other data types for business insights Using OpenAI-compatible APIs to query multi-modal models Building end-to-end pipelines from image ingestion to model deployment Whether analyzing product images, processing visual documents or building content moderation systems, you'll discover how to extract valuable insights from your image data within the Databricks ecosystem.

Automated Deployment with Databricks Asset Bundles

2025-06-10 · Data + AI Summit 2025

talk

CI/CD Data Engineering Databricks DataOps Delta DevOps GitHub Spark

This course provides a comprehensive review of DevOps principles and their application to Databricks projects. It begins with an overview of core DevOps, DataOps, continuous integration (CI), continuous deployment (CD), and testing, and explores how these principles can be applied to data engineering pipelines. The course then focuses on continuous deployment within the CI/CD process, examining tools like the Databricks REST API, SDK, and CLI for project deployment. You will learn about Databricks Asset Bundles (DABs) and how they fit into the CI/CD process. You’ll dive into their key components, folder structure, and how they streamline deployment across various target environments in Databricks. You will also learn how to add variables, modify, validate, deploy, and execute Databricks Asset Bundles for multiple environments with different configurations using the Databricks CLI. Finally, the course introduces Visual Studio Code as an Interactive Development Environment (IDE) for building, testing, and deploying Databricks Asset Bundles locally, optimizing your development process. The course concludes with an introduction to automating deployment pipelines using GitHub Actions to enhance the CI/CD workflow with Databricks Asset Bundles. By the end of this course, you will be equipped to automate Databricks project deployments with Databricks Asset Bundles, improving efficiency through DevOps practices. Pre-requisites: Strong knowledge of the Databricks platform, including experience with Databricks Workspaces, Apache Spark, Delta Lake, the Medallion Architecture, Unity Catalog, Delta Live Tables, and Workflows. In particular, knowledge of leveraging Expectations with Lakeflow Declarative Pipelines. Labs : Yes Certification Path: Databricks Certified Data Engineer Professional

Lakeflow Declarative Pipelines Integrations and Interoperability: Get Data From — and to — Anywhere

2025-06-10 · Data + AI Summit 2025 Watch

talk

by Ryan Nienhuis (Databricks)

Azure Cosmos Data Lakehouse Delta ETL/ELT Kafka MongoDB Python Spark

This session is repeated.In this session, you will learn how to integrate Lakeflow Declarative Pipelines with external systems in order to ingest and send data virtually anywhere. Lakeflow Declarative Pipelines is most often used in ingestion and ETL into the Lakehouse. New Lakeflow Declarative Pipelines capabilities like the Lakeflow Declarative Pipelines Sinks API and added support for Python Data Source and ForEachBatch have opened up Lakeflow Declarative Pipelines to support almost any integration. This includes popular Apache Spark™ integrations like JDBC, Kafka, External and managed Delta tables, Azure CosmosDB, MongoDB and more.

Spark Connect: Flexible, Local Access to Apache Spark at Scale

2025-06-10 · Data + AI Summit 2025 Watch

talk

by James Malone (Databricks)

Spark

What if you could run Spark jobs without worrying about clusters, versions and upgrades? Did you know Spark has this functionality built-in today? Join us to take a look at this functionality — Spark Connect. Join us to dig into how Spark Connect works — abstracting away Spark clusters away in favor of the DataFrame API and unresolved logical plans. You will learn some of the cool things Spark Connect unlocks, including: Moving you from thinking about clusters to just thinking about jobs Making Spark code more portable and platform agnostic Enabling support for languages such as Go

Chaos to Clarity: Secure, Scalable, and Governed SaaS Ingestion through Lakeflow Connect and more

2025-06-10 · Data + AI Summit 2025 Watch

talk

by Krishna Bhupatiraju (Databricks) , Prashant Gupta (Databricks)

Databricks SaaS

Ingesting data from SaaS systems sounds straightforward—until you hit API limits, miss SLAs, or accidentally ingest PII. Sound familiar? In this talk, we’ll share how Databricks evolved from scrappy ingestion scripts to a unified, secure, and scalable ingestion platform. Along the way, we’ll highlight the hard lessons, the surprising pitfalls, and the tools that helped us level up. Whether you’re just starting to wrangle third-party data or looking to scale while handling governance and compliance, this session will help you think beyond pipelines and toward platform thinking.

Crafting Business Brilliance: Leveraging Databricks SQL for Next-Gen Applications

2025-06-10 · Data + AI Summit 2025 Watch

talk

by Mohammad Shalchi (Haleon) , Wasim Ahmad (Databricks)

Data Lakehouse Databricks GenAI SAP SQL

At Haleon, we've leveraged Databricks APIs and serverless compute to develop customer-facing applications for our business. This innovative solution enables us to efficiently deliver SAP invoice and order management data through front-end applications developed and served via our API Gateway. The Databricks lakehouse architecture has been instrumental in eliminating the friction associated with directly accessing SAP data from operational systems, while enhancing our performance capabilities. Our system acheived response times of less than 3 seconds from API call, with ongoing efforts to optimise this performance. This architecture not only streamlines our data and application ecosystem but also paves the way for integrating GenAI capabilities with robust governance measures for our future infrastructure. The implementation of this solution has yielded significant benefits, including a 15% reduction in customer service costs and a 28% increase in productivity for our customer support team.

talk-data.com

Activity Trend

Top Events

Top Speakers

Sponsored by: Google Cloud | Building Powerful Agentic Ecosystems with Google Cloud's A2A

The Upcoming Apache Spark 4.1: The Next Chapter in Unified Analytics

Bridging Big Data and AI: Empowering PySpark With Lance Format for Multi-Modal AI Data Pipelines

Sponsored by: Fivetran | Scalable Data Ingestion: Building custom pipelines with the Fivetran Connector SDK and Databricks

What’s New in PySpark: TVFs, Subqueries, Plots, and Profilers

From Imperative to Declarative Paradigm: Rebuilding a CI/CD Infrastructure Using Hatch and DABs

Real-Time Analytics Pipeline for IoT Device Monitoring and Reporting

Simplify Data Ingest and Egress with the New Python Data Source API

Turn Genie Into an Agent Using Conversation APIs

The Future of DSv2 in Apache Spark™

Unified Advanced Analytics: Integrating Power BI and Databricks Genie for Real-time Insights

Kernel, Catalog, Action! Reimagining our Delta-Spark Connector with DSv2

Cloud-to-Cloud Data Sharing by Walmart: Direct Access to Omni-Channel Sales Data With Delta Sharing

Delta Kernel for Rust and Java

Gaining Insight From Image Data in Databricks Using Multi-Modal Foundation Model API

Automated Deployment with Databricks Asset Bundles

Lakeflow Declarative Pipelines Integrations and Interoperability: Get Data From — and to — Anywhere

Spark Connect: Flexible, Local Access to Apache Spark at Scale

Chaos to Clarity: Secure, Scalable, and Governed SaaS Ingestion through Lakeflow Connect and more

Crafting Business Brilliance: Leveraging Databricks SQL for Next-Gen Applications