API

Sponsored by: Boomi, LP | From Pipelines to Agents: Manage Data and AI on One Platform for Maximum ROI

Breaking Barriers: Building Custom Spark 4.0 Data Connectors with Python

2025-06-11 · Data + AI Summit 2025 Watch

talk

by Sourav Gulati (Databricks) , Ashish Saraswat (Databricks)

Java Python Scala Spark Data Streaming

Building a custom Spark data source connector once required Java or Scala expertise, making it complex and limiting. This left many proprietary data sources without public SDKs disconnected from Spark. Additionally, data sources with Python SDKs couldn't harness Spark’s distributed power. Spark 4.0 changes this with a new Python API for data source connectors, allowing developers to build fully functional connectors without Java or Scala. This unlocks new possibilities, from integrating proprietary systems to leveraging untapped data sources. Supporting both batch and streaming, this API makes data ingestion more flexible than ever. In this talk, we’ll demonstrate how to build a Spark connector for Excel using Python, showcasing schema inference, data reads/writes and streaming support. Whether you're a data engineer or Spark enthusiast, you’ll gain the knowledge to integrate Spark with any data source — entirely in Python.

Sponsored by: Google Cloud | Building Powerful Agentic Ecosystems with Google Cloud's A2A

The Upcoming Apache Spark 4.1: The Next Chapter in Unified Analytics

2025-06-11 · Data + AI Summit 2025 Watch

talk

by DB Tsai (Databricks) , Xiao Li (Databricks)

Analytics Data Quality ETL/ELT PySpark Python Spark

Apache Spark has long been recognized as the leading open-source unified analytics engine, combining a simple yet powerful API with a rich ecosystem and top-notch performance. In the upcoming Spark 4.1 release, the community reimagines Spark to excel at both massive cluster deployments and local laptop development. We’ll start with new single-node optimizations that make PySpark even more efficient for smaller datasets. Next, we’ll delve into a major “Pythonizing” overhaul — simpler installation, clearer error messages and Pythonic APIs. On the ETL side, we’ll explore greater data source flexibility (including the simplified Python Data Source API) and a thriving UDF ecosystem. We’ll also highlight enhanced support for real-time use cases, built-in data quality checks and the expanding Spark Connect ecosystem — bridging local workflows with fully distributed execution. Don’t miss this chance to see Spark’s next chapter!

Bridging Big Data and AI: Empowering PySpark With Lance Format for Multi-Modal AI Data Pipelines

2025-06-11 · Data + AI Summit 2025 Watch

lightning_talk

by LU QIU (LanceDB) , Allison Wang (Databricks)

AI/ML Analytics Big Data Data Analytics Lance PySpark Python Spark SQL

PySpark has long been a cornerstone of big data processing, excelling in data preparation, analytics and machine learning tasks within traditional data lakes. However, the rise of multimodal AI and vector search introduces challenges beyond its capabilities. Spark’s new Python data source API enables integration with emerging AI data lakes built on the multi-modal Lance format. Lance delivers unparalleled value with its zero-copy schema evolution capability and robust support for large record-size data (e.g., images, tensors, embeddings, etc), simplifying multimodal data storage. Its advanced indexing for semantic and full-text search, combined with rapid random access, enables high-performance AI data analytics to the level of SQL. By unifying PySpark's robust processing capabilities with Lance's AI-optimized storage, data engineers and scientists can efficiently manage and analyze the diverse data types required for cutting-edge AI applications within a familiar big data framework.

The Human API: Engineering Better Conversations

2025-06-11 · Platform Engineering MeetUp "Beyond DevOps"

talk

by Steve Wade

Sponsored by: Fivetran | Scalable Data Ingestion: Building custom pipelines with the Fivetran Connector SDK and Databricks

What’s New in PySpark: TVFs, Subqueries, Plots, and Profilers

2025-06-11 · Data + AI Summit 2025 Watch

talk

by Takuya Ueshin (Databricks) , Xinrong Meng (Databricks)

PySpark Python

PySpark’s DataFrame API is evolving to support more expressive and modular workflows. In this session, we’ll introduce two powerful additions: table-valued functions (TVFs) and the new subquery API. You’ll learn how to define custom TVFs using Python User-Defined Table Functions (UDTFs), including support for polymorphism, and how subqueries can simplify complex logic. We’ll also explore how lateral joins connect these features, followed by practical tools for the PySpark developer experience—such as plotting, profiling, and a preview of upcoming capabilities like UDF logging and a Python-native data source API. Whether you're building production pipelines or extending PySpark itself, this talk will help you take full advantage of the latest features in the PySpark ecosystem.

AI and the Lakehouse: How Starburst is Pioneering New Workflows

2025-06-11 · Data Engineering Podcast Listen

podcast_episode

by Tobias Macey , Alex Albu (Starburst)

AI/ML Analytics Dashboard Data Collection Data Contracts Data Engineering Data Lakehouse Data Management Data Quality Datafold Iceberg KPI +6 more

Summary In this episode of the Data Engineering Podcast Alex Albu, tech lead for AI initiatives at Starburst, talks about integrating AI workloads with the lakehouse architecture. From his software engineering roots to leading data engineering efforts, Alex shares insights on enhancing Starburst's platform to support AI applications, including an AI agent for data exploration and using AI for metadata enrichment and workload optimization. He discusses the challenges of integrating AI with data systems, innovations like SQL functions for AI tasks and vector databases, and the limitations of traditional architectures in handling AI workloads. Alex also shares his vision for the future of Starburst, including support for new data formats and AI-driven data exploration tools.

Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementData migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details.This is a pharmaceutical Ad for Soda Data Quality. Do you suffer from chronic dashboard distrust? Are broken pipelines and silent schema changes wreaking havoc on your analytics? You may be experiencing symptoms of Undiagnosed Data Quality Syndrome — also known as UDQS. Ask your data team about Soda. With Soda Metrics Observability, you can track the health of your KPIs and metrics across the business — automatically detecting anomalies before your CEO does. It’s 70% more accurate than industry benchmarks, and the fastest in the category, analyzing 1.1 billion rows in just 64 seconds. And with Collaborative Data Contracts, engineers and business can finally agree on what “done” looks like — so you can stop fighting over column names, and start trusting your data again.Whether you’re a data engineer, analytics lead, or just someone who cries when a dashboard flatlines, Soda may be right for you. Side effects of implementing Soda may include: Increased trust in your metrics, reduced late-night Slack emergencies, spontaneous high-fives across departments, fewer meetings and less back-and-forth with business stakeholders, and in rare cases, a newfound love of data. Sign up today to get a chance to win a $1000+ custom mechanical keyboard. Visit dataengineeringpodcast.com/soda to sign up and follow Soda’s launch week. It starts June 9th. This episode is brought to you by Coresignal, your go-to source for high-quality public web data to power best-in-class AI products. Instead of spending time collecting, cleaning, and enriching data in-house, use ready-made multi-source B2B data that can be smoothly integrated into your systems via APIs or as datasets. With over 3 billion data records from 15+ online sources, Coresignal delivers high-quality data on companies, employees, and jobs. It is powering decision-making for more than 700 companies across AI, investment, HR tech, sales tech, and market intelligence industries. A founding member of the Ethical Web Data Collection Initiative, Coresignal stands out not only for its data quality but also for its commitment to responsible data collection practices. Recognized as the top data provider by Datarade for two consecutive years, Coresignal is the go-to partner for those who need fresh, accurate, and ethically sourced B2B data at scale. Discover how Coresignal's data can enhance your AI platforms. Visit dataengineeringpodcast.com/coresignal to start your free 14-day trial.Your host is Tobias Macey and today I'm interviewing Alex Albu about how Starburst is extending the lakehouse to support AI workloadsInterview IntroductionHow did you get involved in the area of data management?Can you start by outlining the interaction points of AI with the types of data workflows that you are supporting with Starburst?What are some of the limitations of warehouse and lakehouse systems when it comes to supporting AI systems?What are the points of friction for engineers who are trying to employ LLMs in the work of maintaining a lakehouse environment?Methods such as tool use (exemplified by MCP) are a means of bolting on AI models to systems like Trino. What are some of the ways that is insufficient or cumbersome?Can you describe the technical implementation of the AI-oriented features that you have incorporated into the Starburst platform?What are the foundational architectural modifications that you had to make to enable those capabilities?For the vector storage and indexing, what modifications did you have to make to iceberg?What was your reasoning for not using a format like Lance?For teams who are using Starburst and your new AI features, what are some examples of the workflows that they can expect?What new capabilities are enabled by virtue of embedding AI features into the interface to the lakehouse?What are the most interesting, innovative, or unexpected ways that you have seen Starburst AI features used?What are the most interesting, unexpected, or challenging lessons that you have learned while working on AI features for Starburst?When is Starburst/lakehouse the wrong choice for a given AI use case?What do you have planned for the future of AI on Starburst?Contact Info LinkedInParting Question From your perspective, what is the biggest gap in the tooling or technology for data management today?Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story.Links StarburstPodcast EpisodeAWS AthenaMCP == Model Context ProtocolLLM Tool UseVector EmbeddingsRAG == Retrieval Augmented GenerationAI Engineering Podcast EpisodeStarburst Data ProductsLanceLanceDBParquetORCpgvectorStarburst IcehouseThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

From Imperative to Declarative Paradigm: Rebuilding a CI/CD Infrastructure Using Hatch and DABs

2025-06-11 · Data + AI Summit 2025 Watch

talk

by Luigi Di Tacchio (FreeWheel, a Comcast Company)

CI/CD Databricks PySpark

Building and deploying Pyspark pipelines to Databricks should be effortless. However, our team at FreeWheel has, for the longest time, struggled with a convoluted and hard-to-maintain CI/CD infrastructure. It followed an imperative paradigm, demanding that every project implement custom scripts to build artifacts and deploy resources, and resulting in redundant boilerplate code and awkward interactions with the Databricks REST API. We set our mind on rebuilding it from scratch, following a declarative paradigm instead. We will share how we were able to eliminate thousands of lines of code from our repository, create a fully configuration-driven infrastructure where projects can be easily onboarded, and improve the quality of our codebase using Hatch and Databricks Asset Bundles as our tools of choice. In particular, DAB has made deploying across our 3 environments a breeze, and has allowed us to quickly adopt new features as soon as they are released by Databricks.

Real-Time Analytics Pipeline for IoT Device Monitoring and Reporting

2025-06-11 · Data + AI Summit 2025 Watch

talk

by Nayan Sharma (CKDelta) , Padraic Kirrane (CK Delta)

Analytics Cloud Computing Cloud Storage Data Quality Databricks IoT Data Streaming

This session will show how we implemented a solution to support high-frequency data ingestion from smart meters. We implemented a robust API endpoint that interfaces directly with IoT devices. This API processes messages in real time from millions of distributed IoT devices and meters across the network. The architecture leverages cloud storage as a landing zone for the raw data, followed by a streaming pipeline built on Lakeflow Declarative Pipelines. This pipeline implements a multi-layer medallion architecture to progressively clean, transform and enrich the data. The pipeline operates continuously to maintain near real-time data freshness in our gold layer tables. These datasets connect directly to Databricks Dashboards, providing stakeholders with immediate insights into their operational metrics. This solution demonstrates how modern data architecture can handle high-volume IoT data streams while maintaining data quality and providing accessible real-time analytics for business users.

Simplify Data Ingest and Egress with the New Python Data Source API

2025-06-11 · Data + AI Summit 2025 Watch

talk

by Craig Lukasik (Databricks)

Data Engineering Databricks Python Spark

Data engineering teams are frequently tasked with building bespoke ingest and/or egress solutions for myriad custom, proprietary, or industry-specific data sources or sinks. Many teams find this work cumbersome and time-consuming. Recognizing these challenges, Databricks interviewed numerous companies across different industries to better understand their diverse data integration needs. This comprehensive feedback led us to develop the Python Data Source API for Apache Spark™.

Turn Genie Into an Agent Using Conversation APIs

2025-06-11 · Data + AI Summit 2025 Watch

talk

by Prithvi Kannan (Databricks) , Hanlin Sun (Databricks)

AI/ML Analytics BI SQL

Transform your AI/BI Genie into a text-to-SQL powerhouse using the Genie Conversation APIs. This session explores how Genie functions as an intelligent agent, translating natural language queries into SQL to accelerate insights and enhance self-service analytics. You'll learn practical techniques for configuring agents, optimizing queries and handling errors — ensuring Genie delivers accurate, relevant responses in real time. A must-attend for teams looking to level up their AI/BI capabilities and deliver smarter analytics experiences.

The Future of DSv2 in Apache Spark™

2025-06-10 · Data + AI Summit 2025 Watch

talk

by Anton Okolnychyi (Databricks)

Spark

DSv2, Spark's next-generation Catalog API, is gaining traction among data source developers. It shifts complexity to Apache Spark™, improves connector reliability and unlocks new functionality such as catalog federation, MERGE operations, storage-partitioned joins, aggregate pushdown, stored procedures and more. This session covers the design of DSv2, current strengths and gaps and its evolving roadmap. It's intended for Spark users and developers working with data sources, whether custom-built or off-the-shelf.

Unified Advanced Analytics: Integrating Power BI and Databricks Genie for Real-time Insights

2025-06-10 · Data + AI Summit 2025 Watch

talk

by Justin Ward (TurnPoint Services) , Edelweiss Kammermann (IT Convergence)

Analytics Azure BI Dashboard Databricks Power BI

In today’s data-driven landscape, business users expect seamless, interactive analytics without having to switch between different environments. This presentation explores our web application that unifies a Power BI dashboard with Databricks Genie, allowing users to query and visualize insights from the same dataset within a single, cohesive interface. We will compare two integration strategies: one that leverages a traditional webpage enhanced by an Azure bot to incorporate Genie’s capabilities, and another that utilizes Databricks Apps to deliver a smoother, native experience. We use the Genie API to build this solution. Attendees will learn the architecture behind these solutions, key design considerations and challenges encountered during implementation. Join us to see live demos of both approaches, and discover best practices for delivering an all-in-one, interactive analytics experience.

Kernel, Catalog, Action! Reimagining our Delta-Spark Connector with DSv2

2025-06-10 · Data + AI Summit 2025 Watch

lightning_talk

by Scott Sandre (Databricks)

Data Lakehouse Delta Spark

Delta Lake is redesigning its Spark connector through the combination of three key technologies: First, we're updating our Spark APIs to DSv2 to achieve deeper catalog integration and improved integration with the Spark optimizer. Second, we're fully integrating on top of Delta Kernel to take advantage of its simplified abstraction of Delta protocol complexities, accelerating feature adoption and improving maintainability. Third, we are transforming Delta to become a catalog-aware lakehouse format with Catalog Commits, enabling more efficient metadata management, governance and query performance. Join us to explore how we're advancing Delta Lake's architecture, pushing the boundaries of metadata management and creating a more intelligent, performant data lakehouse platform.

Cloud-to-Cloud Data Sharing by Walmart: Direct Access to Omni-Channel Sales Data With Delta Sharing

2025-06-10 · Data + AI Summit 2025

talk

by Roberto Robles Nacif (Walmart Data Ventures) , Ajay Bhonsule (Walmart Inc.)

Analytics Cloud Computing Databricks Delta Omni

As first-party data becomes increasingly invaluable to organizations, Walmart Data Ventures is dedicated to bringing to life new applications of Walmart’s first-party data to better serve its customers. Through Scintilla, its integrated insights ecosystem, Walmart Data Ventures continues to expand its offerings to deliver insights and analytics that drive collaboration between our merchants, suppliers, and operators.Scintilla users can now access Walmart data using Cloud Feeds, based on Databricks Delta Sharing technologies. In the past, Walmart used API-based data sharing models, which required users to possess certain skills and technical attributes that weren’t always available. Now, with Cloud Feeds, Scintilla users can more easily access data without a dedicated technical team behind the scenes making it happen. Attendees will gain valuable insights into how Walmart has built its robust data sharing architecture and strategies to design scalable and collaborative data sharing architectures in their own organizations.

Delta Kernel for Rust and Java

2025-06-10 · Data + AI Summit 2025 Watch

talk

by Nick Lanham (Databricks)

AI/ML C#/.NET ClickHouse Delta DuckDB Java Rust

Delta Kernel makes it easy for engines and connectors to read and write Delta tables. It supports many Delta features and robust connectors, including DuckDB, Clickhouse, Spice AI and delta-dotnet. In this session, we'll cover lessons learned about how to build a high-performance library that lets engines integrate the way they want, while not having to worry about the details of the Delta protocol. We'll talk through how we streamlined the API as well as its changes and underlying motivations. We'll discuss some new highlight features like write support, and the ability to do CDF scans. Finally we'll cover the future roadmap for the Kernel project and what you can expect from the project over the coming year.

Gaining Insight From Image Data in Databricks Using Multi-Modal Foundation Model API

2025-06-10 · Data + AI Summit 2025 Watch

lightning_talk

by Ankit Mathur (Databricks)

Data Lakehouse Databricks LLM

Unlock the hidden potential in your image data without specialized computer vision expertise! This session explores how to leverage Databricks' multi-modal Foundation Model APIs to analyze, classify and extract insights from visual content. Learn how Databricks provides a unified API to understand images using powerful foundation models within your data workflows. Key takeaways: Implementing efficient workflows for image data processing within your Databricks lakehouse Understanding multi-modal foundation models for image understanding Integrating image analysis with other data types for business insights Using OpenAI-compatible APIs to query multi-modal models Building end-to-end pipelines from image ingestion to model deployment Whether analyzing product images, processing visual documents or building content moderation systems, you'll discover how to extract valuable insights from your image data within the Databricks ecosystem.

talk-data.com

Activity Trend

Top Events

Top Speakers

Sponsored by: Boomi, LP | From Pipelines to Agents: Manage Data and AI on One Platform for Maximum ROI

Sponsored by: EY | Unlocking Value Through AI at Takeda Pharmaceuticals

Breaking Barriers: Building Custom Spark 4.0 Data Connectors with Python

Sponsored by: Google Cloud | Building Powerful Agentic Ecosystems with Google Cloud's A2A

The Upcoming Apache Spark 4.1: The Next Chapter in Unified Analytics

Bridging Big Data and AI: Empowering PySpark With Lance Format for Multi-Modal AI Data Pipelines

The Human API: Engineering Better Conversations

Sponsored by: Fivetran | Scalable Data Ingestion: Building custom pipelines with the Fivetran Connector SDK and Databricks

What’s New in PySpark: TVFs, Subqueries, Plots, and Profilers

AI and the Lakehouse: How Starburst is Pioneering New Workflows

From Imperative to Declarative Paradigm: Rebuilding a CI/CD Infrastructure Using Hatch and DABs

Real-Time Analytics Pipeline for IoT Device Monitoring and Reporting

Simplify Data Ingest and Egress with the New Python Data Source API

Turn Genie Into an Agent Using Conversation APIs

The Future of DSv2 in Apache Spark™

Unified Advanced Analytics: Integrating Power BI and Databricks Genie for Real-time Insights

Kernel, Catalog, Action! Reimagining our Delta-Spark Connector with DSv2

Cloud-to-Cloud Data Sharing by Walmart: Direct Access to Omni-Channel Sales Data With Delta Sharing

Delta Kernel for Rust and Java

Gaining Insight From Image Data in Databricks Using Multi-Modal Foundation Model API