Python

Sponsored by: Redpanda | IoT for Fun & Prophet: Scaling IoT and predicting the future with Redpanda, Iceberg & Prophet

What’s New in Apache Spark™ 4.0?

2025-06-12 · Data + AI Summit 2025 Watch

talk

by Daniel Tenedorio (Databricks) , Wenchen Fan (Databricks)

AI/ML API Java Scala Spark SQL Data Streaming

Join this session for a concise tour of Apache Spark™ 4.0’s most notable enhancements: SQL features: ANSI by default, scripting, SQL pipe syntax, SQL UDF, session variable, view schema evolution, etc. Data type: VARIANT type, string collation Python features: Python data source, plotting API, etc. Streaming improvements: State store data source, state store checkpoint v2, arbitrary state v2, etc. Spark Connect improvements: More API coverage, thin client, unified Scala interface, etc. Infrastructure: Better error message, structured logging, new Java/Scala version support, etc. Whether you’re a seasoned Spark user or new to the ecosystem, this talk will prepare you to leverage Spark 4.0’s latest innovations for modern data and AI pipelines.

Using Delta-rs and Delta-Kernel-rs to Serve CDC Feeds

2025-06-12 · Data + AI Summit 2025 Watch

talk

by Stephen Carman (Databricks) , Oussama Saoudi (Databricks)

Databricks Delta Rust Spark

Change data feeds are a common tool for synchronizing changes between tables and performing data processing in a scalable fashion. Serverless architectures offer a compelling solution for organizations looking to avoid the complexity of managing infrastructure. But how can you bring CDFs into a serverless environment? In this session, we'll explore how to integrate Change Data Feeds into serverless architectures using Delta-rs and Delta-kernel-rs—open-source projects that allow you to read Delta tables and their change data feeds in Rust or Python. We’ll demonstrate how to use these tools with Lakestore’s serverless platform to easily stream and process changes. You’ll learn how to: Leverage Delta tables and CDFs in serverless environments Utilize Databricks and Unity Catalog without needing Apache Spark

Next-Gen Data Science: How Posit and Databricks Are Transforming Analytics at Scale

2025-06-11 · Data + AI Summit 2025 Watch

lightning_talk

by James Blair (Posit, PBC)

Analytics API Data Science Databricks

Modern data science teams face the challenge of navigating complex landscapes of languages, tools and infrastructure. Positron, Posit’s next-generation IDE, offers a powerful environment tailored for data science, seamlessly integrating with Databricks to empower teams working in Python and R. Now integrated within Posit Workbench, Positron enables data scientists to efficiently develop, iterate and analyze data with Databricks — all while maintaining their preferred workflows. In this session, we’ll explore how Python and R users can develop, deploy and scale their data science workflows by combining Posit tools with Databricks. We’ll showcase how Positron simplifies development for both Python and R and how Posit Connect enables seamless deployment of applications, reports and APIs powered by Databricks. Join us to see how Posit + Databricks create a frictionless, scalable and collaborative data science experience — so your teams can focus on insights, not infrastructure.

Hands-On Learning: AI Agents Workshop: Create, Evaluate, and Deploy using Mosaic AI

2025-06-11 · Data + AI Summit 2025

workshop

by Amber Roberts (Databricks) , Nicolas Pelaez (Databricks)

AI/ML Databricks SQL

Looking for a practical workshop on building an AI Agent on Databricks? Well, we have just the thing for you.This hands-on workshop takes you through the process of creating intelligent agents that can reason their way to useful outcomes. You'll start by building your own toolkit of SQL and Python functions that give your agent practical capabilities. Then we'll explore how to select the right foundation model for your needs, connect your custom tools, and watch as your agent tackles complex challenges through visible reasoning paths.The workshop doesn't just stop at building—you'll dive into evaluation techniques using evaluation datasets to identify where your agent shines and where it needs improvement. After implementing and measuring your changes, we'll explore deployment strategies, including a feedback collection interface that enables continuous improvement and governance mechanisms to ensure responsible AI usage in production environments.

The Full Stack of Innovation: Building Data and AI Products With Databricks Apps

2025-06-11 · Data + AI Summit 2025 Watch

talk

by Giran Moodley (Databricks) , Ivan Trusov (Databricks)

AI/ML CI/CD Databricks JavaScript

In this deep-dive technical session, Ivan Trusov (Sr. SSA @ Databricks) and Giran Moodley (SA @ Databricks) — will explore the full-stack development of Databricks Apps, covering everything from frameworks to deployment. We’ll walk through essential topics, including: Frameworks & tooling — Pythonic (Dash, Streamlit, Gradio) vs. JS + Python stack Development lifecycle — Debugging, issue resolution and best practices Testing — Unit, integration and load testing strategies CI/CD & deployment — Automating with Databricks Asset Bundles Monitoring & observability — OpenTelemetry, metrics collection and analysis Expect a highly practical session with several live demos, showcasing the development loop, testing workflows and CI/CD automation. Whether you’re building internal tools or AI-powered products, this talk will equip you with the knowledge to ship robust, scalable Databricks Apps.

Breaking Barriers: Building Custom Spark 4.0 Data Connectors with Python

2025-06-11 · Data + AI Summit 2025 Watch

talk

by Sourav Gulati (Databricks) , Ashish Saraswat (Databricks)

API Java Scala Spark Data Streaming

Building a custom Spark data source connector once required Java or Scala expertise, making it complex and limiting. This left many proprietary data sources without public SDKs disconnected from Spark. Additionally, data sources with Python SDKs couldn't harness Spark’s distributed power. Spark 4.0 changes this with a new Python API for data source connectors, allowing developers to build fully functional connectors without Java or Scala. This unlocks new possibilities, from integrating proprietary systems to leveraging untapped data sources. Supporting both batch and streaming, this API makes data ingestion more flexible than ever. In this talk, we’ll demonstrate how to build a Spark connector for Excel using Python, showcasing schema inference, data reads/writes and streaming support. Whether you're a data engineer or Spark enthusiast, you’ll gain the knowledge to integrate Spark with any data source — entirely in Python.

The Upcoming Apache Spark 4.1: The Next Chapter in Unified Analytics

2025-06-11 · Data + AI Summit 2025 Watch

talk

by DB Tsai (Databricks) , Xiao Li (Databricks)

Analytics API Data Quality ETL/ELT PySpark Spark

Apache Spark has long been recognized as the leading open-source unified analytics engine, combining a simple yet powerful API with a rich ecosystem and top-notch performance. In the upcoming Spark 4.1 release, the community reimagines Spark to excel at both massive cluster deployments and local laptop development. We’ll start with new single-node optimizations that make PySpark even more efficient for smaller datasets. Next, we’ll delve into a major “Pythonizing” overhaul — simpler installation, clearer error messages and Pythonic APIs. On the ETL side, we’ll explore greater data source flexibility (including the simplified Python Data Source API) and a thriving UDF ecosystem. We’ll also highlight enhanced support for real-time use cases, built-in data quality checks and the expanding Spark Connect ecosystem — bridging local workflows with fully distributed execution. Don’t miss this chance to see Spark’s next chapter!

Bridging Big Data and AI: Empowering PySpark With Lance Format for Multi-Modal AI Data Pipelines

2025-06-11 · Data + AI Summit 2025 Watch

lightning_talk

by LU QIU (LanceDB) , Allison Wang (Databricks)

AI/ML Analytics API Big Data Data Analytics Lance PySpark Spark SQL

PySpark has long been a cornerstone of big data processing, excelling in data preparation, analytics and machine learning tasks within traditional data lakes. However, the rise of multimodal AI and vector search introduces challenges beyond its capabilities. Spark’s new Python data source API enables integration with emerging AI data lakes built on the multi-modal Lance format. Lance delivers unparalleled value with its zero-copy schema evolution capability and robust support for large record-size data (e.g., images, tensors, embeddings, etc), simplifying multimodal data storage. Its advanced indexing for semantic and full-text search, combined with rapid random access, enables high-performance AI data analytics to the level of SQL. By unifying PySpark's robust processing capabilities with Lance's AI-optimized storage, data engineers and scientists can efficiently manage and analyze the diverse data types required for cutting-edge AI applications within a familiar big data framework.

Using Clean Rooms for Privacy-Centric Data Collaboration

2025-06-11 · Data + AI Summit 2025 Watch

talk

by DJ Sharkey (Databricks) , Nikhil Gaekwad (Databricks)

AI/ML Analytics Databricks Delta SQL

Databricks Clean Rooms make privacy-safe collaboration possible for data, analytics, and AI — across clouds and platforms. Built on Delta Sharing, Clean Rooms enable organizations to securely share and analyze data together in a governed, isolated environment — without ever exposing raw data. In this session, you’ll learn how to get started with Databricks Clean Rooms and unlock advanced use cases including: Cross-platform collaboration and joint analytics Training machine learning and AI models Enforcing custom privacy policies Analyzing unstructured data Incorporating proprietary libraries in Python and SQL notebooks Auditing clean room activity for compliance Whether you're a data scientist, engineer or data leader, this session will equip you to drive high-value collaboration while maintaining full control over data privacy and governance.

What’s New in PySpark: TVFs, Subqueries, Plots, and Profilers

2025-06-11 · Data + AI Summit 2025 Watch

talk

by Takuya Ueshin (Databricks) , Xinrong Meng (Databricks)

API PySpark

PySpark’s DataFrame API is evolving to support more expressive and modular workflows. In this session, we’ll introduce two powerful additions: table-valued functions (TVFs) and the new subquery API. You’ll learn how to define custom TVFs using Python User-Defined Table Functions (UDTFs), including support for polymorphism, and how subqueries can simplify complex logic. We’ll also explore how lateral joins connect these features, followed by practical tools for the PySpark developer experience—such as plotting, profiling, and a preview of upcoming capabilities like UDF logging and a Python-native data source API. Whether you're building production pipelines or extending PySpark itself, this talk will help you take full advantage of the latest features in the PySpark ecosystem.

Simplify Data Ingest and Egress with the New Python Data Source API

2025-06-11 · Data + AI Summit 2025 Watch

talk

by Craig Lukasik (Databricks)

API Data Engineering Databricks Spark

Data engineering teams are frequently tasked with building bespoke ingest and/or egress solutions for myriad custom, proprietary, or industry-specific data sources or sinks. Many teams find this work cumbersome and time-consuming. Recognizing these challenges, Databricks interviewed numerous companies across different industries to better understand their diverse data integration needs. This comprehensive feedback led us to develop the Python Data Source API for Apache Spark™.

Delta-rs Turning Five: Growing Pains and Life Lessons

2025-06-10 · Data + AI Summit 2025 Watch

lightning_talk

by Robert Pack (Databricks)

Delta Rust

Five years ago, the delta-rs project embarked on a journey to bring Delta Lake's robust capabilities to the Rust & Python ecosystem. In this talk, we'll delve into the triumphs, tribulations and lessons learned along the way. We'll explore how delta-rs has matured alongside the thriving Rust data ecosystem, adapting to its evolving landscape and overcoming the challenges of maintaining a complex data project. Join us as we share insights into the project's evolution, the symbiotic relationship between delta-rs and the Rust community, and the current hurdles and future directions that lie ahead. Audio for this session is delivered in the conference mobile app, you must bring your own headphones to listen.

Machine Learning Operations

2025-06-10 · Data + AI Summit 2025

talk

AI/ML Data Lakehouse Databricks Delta MLOps

This course will guide participants through a comprehensive exploration of machine learning model operations, focusing on MLOps and model lifecycle management. The initial segment covers essential MLOps components and best practices, providing participants with a strong foundation for effectively operationalizing machine learning models. In the latter part of the course, we will delve into the basics of the model lifecycle, demonstrating how to navigate it seamlessly using the Model Registry in conjunction with the Unity Catalog for efficient model management. By the course's conclusion, participants will have gained practical insights and a well-rounded understanding of MLOps principles, equipped with the skills needed to navigate the intricate landscape of machine learning model operations. Pre-requisites: Familiarity with Databricks workspace and notebooks, familiarity with Delta Lake and Lakehouse, intermediate level knowledge of Python (e.g. understanding of basic MLOps concepts and practices as well as infrastructure and importance of monitoring MLOps solutions) Labs: Yes Certification Path: Databricks Certified Machine Learning Associate

Pushing the Limits of What Your Warehouse Can Do Using Python and Databricks

2025-06-10 · Data + AI Summit 2025 Watch

lightning_talk

by Jakob Mund (Databricks)

Cloud Computing Databricks SQL

SQL warehouses in Databricks can run more than just SQL. Join this session to learn how to get more out of your SQL warehouses and any tools built on top of it by leveraging Python. After attending this session, you will be familiar with Python user-defined functions and how to bring in custom dependencies from PyPi, as a custom wheel or even securely invoke cloud services with performance at scale.

Lakeflow Declarative Pipelines Integrations and Interoperability: Get Data From — and to — Anywhere

2025-06-10 · Data + AI Summit 2025 Watch

talk

by Ryan Nienhuis (Databricks)

API Azure Cosmos Data Lakehouse Delta ETL/ELT Kafka MongoDB Spark

This session is repeated.In this session, you will learn how to integrate Lakeflow Declarative Pipelines with external systems in order to ingest and send data virtually anywhere. Lakeflow Declarative Pipelines is most often used in ingestion and ETL into the Lakehouse. New Lakeflow Declarative Pipelines capabilities like the Lakeflow Declarative Pipelines Sinks API and added support for Python Data Source and ForEachBatch have opened up Lakeflow Declarative Pipelines to support almost any integration. This includes popular Apache Spark™ integrations like JDBC, Kafka, External and managed Delta tables, Azure CosmosDB, MongoDB and more.

Spark 4.0 and Delta 4.0 For Streaming Data

2025-06-10 · Data + AI Summit 2025 Watch

talk

by Bryce Bartmann (Shell)

AI/ML Delta JSON Spark Data Streaming

Real-time data is one of the most important datasets for any Data and AI Platform across any industry. Spark 4.0 and Delta 4.0 include new features that make ingestion and querying of real-time data better than ever before. Features such as: Python custom data sources for simple ingestion of streaming and batch time series data sources using Spark Variant types for managing variable data types and json payloads that are common in the real time domain Delta liquid clustering for simple data clustering without the overhead or complexity of partitioning In this presentation you will learn how data teams can leverage these latest features to build industry-leading, real-time data products using Spark and Delta and includes real world examples and metrics of the improvements they make in performance and processing of data in the real time space.

No-Code Change in Your Python UDF for Arrow Optimization

2025-06-10 · Data + AI Summit 2025 Watch

lightning_talk

by Hyukjin Kwon (Databricks)

API Arrow Pandas Spark

Apache Spark™ has introduced Arrow-optimized APIs such as Pandas UDFs and the Pandas Functions API, providing high performance for Python workloads. Yet, many users continue to rely on regular Python UDFs due to their simple interface, especially when advanced Python expertise is not readily available. This talk introduces a powerful new feature in Apache Spark that brings Arrow optimization to regular Python UDFs. With this enhancement, users can leverage performance gains without modifying their existing UDFs — simply by enabling a configuration setting or toggling a UDF-level parameter. Additionally, we will dive into practical tips and features for using Arrow-optimized Python UDFs effectively, exploring their strengths and limitations. Whether you’re a Spark beginner or an experienced user, this session will allow you to achieve the best of both simplicity and performance in your workflows with regular Python UDFs.

CI/CD for Databricks: Advanced Asset Bundles and GitHub Actions

2025-06-10 · Data + AI Summit 2025 Watch

talk

by Dustin Vannoy (Databricks)

CI/CD Databricks GitHub

This session is repeated.Databricks Asset Bundles (DABs) provide a way to use the command line to deploy and run a set of Databricks assets — like notebooks, Python code, Lakeflow Declarative Pipelines and workflows. To automate deployments, you create a deployment pipeline that uses the power of DABs along with other validation steps to ensure high quality deployments.In this session you will learn how to automate CI/CD processes for Databricks while following best practices to keep deployments easy to scale and maintain. After a brief explanation of why Databricks Asset Bundles are a good option for CI/CD, we will walk through a working project including advanced variables, target-specific overrides, linting, integration testing and automatic deployment upon code review approval. You will leave the session clear on how to build your first GitHub Action using DABs.ub Action using DABs.

talk-data.com

Activity Trend

Top Events

Top Speakers

Sponsored by: Redpanda | IoT for Fun & Prophet: Scaling IoT and predicting the future with Redpanda, Iceberg & Prophet

What’s New in Apache Spark™ 4.0?

Using Delta-rs and Delta-Kernel-rs to Serve CDC Feeds

Next-Gen Data Science: How Posit and Databricks Are Transforming Analytics at Scale

Hands-On Learning: AI Agents Workshop: Create, Evaluate, and Deploy using Mosaic AI

The Full Stack of Innovation: Building Data and AI Products With Databricks Apps

Sponsored by: ThoughtSpot | Supercharge Your Databricks Investment with DataSpot

Breaking Barriers: Building Custom Spark 4.0 Data Connectors with Python

The Upcoming Apache Spark 4.1: The Next Chapter in Unified Analytics

Bridging Big Data and AI: Empowering PySpark With Lance Format for Multi-Modal AI Data Pipelines

Using Clean Rooms for Privacy-Centric Data Collaboration

What’s New in PySpark: TVFs, Subqueries, Plots, and Profilers

Simplify Data Ingest and Egress with the New Python Data Source API

Delta-rs Turning Five: Growing Pains and Life Lessons

Machine Learning Operations

Pushing the Limits of What Your Warehouse Can Do Using Python and Databricks

Lakeflow Declarative Pipelines Integrations and Interoperability: Get Data From — and to — Anywhere

Spark 4.0 and Delta 4.0 For Streaming Data

No-Code Change in Your Python UDF for Arrow Optimization

CI/CD for Databricks: Advanced Asset Bundles and GitHub Actions