Data + AI Summit 2025

Power BI and Databricks: Practical Best Practices

2025-06-10 Watch

talk

Hobbs Hobbs (Databricks)

BI Databricks Power BI

This session is repeated. Power BI has long been the dominant BI tool in the market. In this session, we'll discuss how to get the most out of PowerBI and Databricks, beginning with high-level architecture and moving down into detailed how-to guides for troubleshooting common failure points. At the end, you'll receive a cheat-sheet which summarizes those best practices into an easy-to-reference format.

Scaling Data Governance: How Unity Catalog is Empowering Picpay's Data Governance Strategy

2025-06-10 Watch

talk

Lucas Morelato (PicPay) , Gustavo Tadao Okida (PicPay)

Data Governance

With massive data volume and complexity, scaling data governance became a significant challenge. Centralizing metadata management, ensuring regulatory compliance and controlling data access across multiple platforms turned to be critical to maintaining efficiency and trust.

Scaling Demand Forecasting at Nikon: Automating Camera Accessories Sales Planning with Databricks

2025-06-10 Watch

talk

Heya Ouyang (Nikon Corporation)

AI/ML Databricks

At Nikon, camera accessories are essential in meeting the diverse needs of professional photographers worldwide, making their timely availability a priority. Forecasting accessories, however, presents unique challenges including dependencies on parent products, sparse demand patterns, and managing predictions for thousands of items across global subsidiaries. To address this, we leveraged Databricks' unified data and AI platform to develop and deploy an automated, scalable solution for accessory sales planning. Our solution employs a hybrid approach that auto-selects best algorithm from a suite of ML and time-series models, incorporating anomaly detection and methods to handle sparse and low-demand scenarios. MLflow is utilized to automate model logging and versioning, enabling efficient management, and scalable deployment. The framework includes data preparation, model selection and training, performance tracking, prediction generation, and output processing for downstream systems.

Scaling XGBoost With Spark Connect ML on Grace Blackwell

2025-06-10 Watch

talk

Bobby Wang (NVIDIA) , Jiaming Yuan (NVIDIA Semiconductor Co., Ltd)

AI/ML Spark

XGBoost is one of the off-the-shelf gradient boosting algorithms for analyzing tabular datasets. Unlike deep learning, gradient-boosting decision trees require the entire dataset to be in memory for efficient model training. To overcome the limitation, XGBoost features a distributed out-of-core implementation that fetches data in batch, which benefits significantly from the latest NVIDIA GPUs and the NVLink-C2C’s ultra bandwidth. In this talk, we will share our work on optimizing XGBoost using the Grace Blackwell super chip. The fast chip-to-chip link between the CPU and the GPU enables XGBoost to scale up without compromising performance. Our work has effectively increased XGBoost’s training capacity to over 1.2TB on a single node. The approach is scalable to GPU clusters using Spark, enabling XGBoost to handle terabytes of data efficiently. We will demonstrate combining XGBoost out-of-core algorithms with the latest connect ML from Spark 4.0 for large model training workflows.

Spark 4.0 and Delta 4.0 For Streaming Data

2025-06-10 Watch

talk

Bryce Bartmann (Shell)

AI/ML Delta JSON Python Spark Data Streaming

Real-time data is one of the most important datasets for any Data and AI Platform across any industry. Spark 4.0 and Delta 4.0 include new features that make ingestion and querying of real-time data better than ever before. Features such as: Python custom data sources for simple ingestion of streaming and batch time series data sources using Spark Variant types for managing variable data types and json payloads that are common in the real time domain Delta liquid clustering for simple data clustering without the overhead or complexity of partitioning In this presentation you will learn how data teams can leverage these latest features to build industry-leading, real-time data products using Spark and Delta and includes real world examples and metrics of the improvements they make in performance and processing of data in the real time space.

Spark Connect: Flexible, Local Access to Apache Spark at Scale

2025-06-10 Watch

talk

James Malone (Databricks)

API Spark

What if you could run Spark jobs without worrying about clusters, versions and upgrades? Did you know Spark has this functionality built-in today? Join us to take a look at this functionality — Spark Connect. Join us to dig into how Spark Connect works — abstracting away Spark clusters away in favor of the DataFrame API and unresolved logical plans. You will learn some of the cool things Spark Connect unlocks, including: Moving you from thinking about clusters to just thinking about jobs Making Spark code more portable and platform agnostic Enabling support for languages such as Go

Unlocking the Future of Dairy Farming: Leveraging Data Marketplaces at Lely

2025-06-10 Watch

talk

Simon Krejci (Lely) , Bulut Ficici (Lely)

Analytics Databricks Delta

Lely, a Dutch company specializing in dairy farming robotics, helps farmers with advanced solutions for milking, feeding and cleaning. This session explores Lely’s implementation of an Internal Data Marketplace, built around Databricks' Private Exchange Marketplace. The marketplace serves as a central hub for data teams and business users, offering seamless access to data, analytics and dashboards. Powered by Delta Sharing, it enables secure, private listing of data products across business domains, including notebooks, views, models and functions. This session covers the pros and cons of this approach, best practices for setting up a data marketplace and its impact on Lely’s operations. Real-world examples and insights will showcase the potential of integrating data-driven solutions into dairy farming. Join us to discover how data innovation drives the future of dairy farming through Lely’s experience.

AI Agents in Action: Structuring Unstructured Data on Demand With Databricks and Unstructured

2025-06-10 Watch

lightning_talk

Christopher Maddock (Unstructured)

AI/ML Databricks Delta LLM

LLM agents aren’t just answering questions — they’re running entire workflows. In this talk, we’ll show how agents can autonomously ingest, process and structure unstructured data using Unstructured, with outputs flowing directly into Databricks. Powered by the Model Context Protocol (MCP), agents can interface with Unstructured’s full suite of capabilities — discovering documents across sources, building ephemeral workflows and exporting structured insights into Delta tables. We’ll walk through a demo where an agent responds to a natural language request, dynamically pulls relevant documents, transforms them into usable data and surfaces insights — fast. Join us for a sneak peek into the future of AI-native data workflows, where LLMs don’t just assist — they operate.

Breaking Silos: Cigna’s Journey to Seamless Data Sharing with Delta Sharing

2025-06-10 Watch

lightning_talk

Jay Ehlen (Evernorth Health Services) , Nick De Young (The Cigna Group)

Databricks Delta

As data ecosystems grow increasingly complex, the ability to share data securely, seamlessly, and in real time has become a strategic differentiator. In this session, Cigna will showcase how Delta Sharing on Databricks has enabled them to modernize data delivery, reduce operational overhead, and unlock new market opportunities. Learn how Cigna achieved significant savings by streamlining operations, compute, and platform overhead for just one use case. Explore how decentralizing data ownership—transitioning from hyper-centralized teams to empowered product owners—has simplified delivery and accelerated innovation. Most importantly, see how this modern open data-sharing framework has positioned Cigna to win contracts they previously couldn’t, by enabling real-time, cross-organizational data collaboration with external partners. Join us to hear how Cigna is using Delta Sharing not just as a technical enabler, but as a business catalyst.

Cracking Complex Documents with Databricks Mosaic AI

2025-06-10 Watch

lightning_talk

Gavi Regunath (Advancing Analytics)

AI/ML Databricks GenAI

In this session, we will share how we are transforming the way organizations process unstructured and non-standard documents using Mosaic AI and agentic patterns within the Databricks ecosystem. We have developed a scalable pipeline that turns complex legal and regulatory content into structured, tabular data.We will walk through the full architecture, which includes Unity Catalog for secure and governed data access, Databricks Vector Search for intelligent indexing and retrieval and Databricks Apps to deliver clear insights to business users. The solution supports multiple languages and formats, making it suitable for teams working across different regions. We will also discuss some of the key technical challenges we addressed, including handling parsing inconsistencies, grounding model responses and ensuring traceability across the entire process. If you are exploring how to apply GenAI and large language models, this session is for you. Audio for this session is delivered in the conference mobile app, you must bring your own headphones to listen.

From Spaghetti Bowl Pipeline to Lakeflow Declarative Pipelines Efficiency

2025-06-10

lightning_talk

Peter Jones (Intermountain Healthcare)

Alteryx Data Engineering Delta Spark

In today's data-driven world, the ability to efficiently manage and transform data is crucial for any organization. This presentation will explore the process of converting a complex and messy workflow into a clean and simple Lakeflow Declarative Pipelines at a large integrated health system, Intermountain Health.Alteryx is a powerful tool for data preparation and blending, but as workflows grow in complexity, they can become difficult to manage and maintain. Lakeflow Declarative Pipelines, on the other hand, offers a more democratized, streamlined and scalable approach to data engineering, leveraging the power of Apache Spark and Delta Lake.We will begin by examining a typical legacy workflow, identifying common pain points such as tangled logic, performance bottlenecks and maintenance challenges. Next, we will demonstrate how to translate this workflow into a Lakeflow Declarative Pipelines, highlighting key steps such as data transformation, validation and delivery.

Learn to Program Not Write Prompts with DSPy

2025-06-10 Watch

lightning_talk

Austin Choi (Databricks)

AI/ML GenAI

Writing prompts for our GenAI applications is long, tedious, and unmaintainable. A proper software development lifecycle requires proper testing and maintenance, something incredibly difficult to do on a block of text. Our current prompt engineering best practices have largely been manual trial and error, testing which of our prompts work well in certain situations. This process worsens as our prompts become more complex, adding multiple tasks and functionality within one long singular prompt. Enter DSPy, your PROGRAMATIC way of building GenAI Applications. Learn how DSPy allows you to modularize your prompt into modules and enforce typing through signatures. Then, utilize state of the art algorithms to optimize the prompts and weights against your evaluation datasets, just like machine learning! We will compare DSPy to a restaurant to help illustrate and demo DSPy’s capabilities. It's time to start programming, rather than prompting, again!

Petabyte-Scale On-Chain Insights: Real-Time Intelligence for the Next-Gen Financial Backbone

2025-06-10 Watch

lightning_talk

Leo Liang (CipherOwl Inc)

AI/ML Analytics Arrow Blockchain Data Lakehouse Delta

We’ll explore how CipherOwl Inc. constructed a near real-time, multi-chain data lakehouse to power anti-money laundering (AML) monitoring at a petabyte scale. We will walk through the end-to-end architecture, which integrates cutting-edge open-source technologies and AI-driven analytics to handle massive on-chain data volumes seamlessly. Off-chain intelligence complements this to meet rigorous AML requirements. At the core of our solution is ChainStorage, an OSS started by Coinbase that provides robust blockchain data ingestion and block-level serving. We enhanced it with Apache Spark™ and Arrow™, coupled for high-throughput processing and efficient data serialization, backed by Delta Lake and Kafka. For the serving layer, we employ StarRocks to deliver lightning-fast SQL analytics over vast datasets. Finally, our system incorporates machine learning and AI agents for continuous data curation and near real-time insights, which are crucial for tackling on-chain AML challenges.

Sponsored by: Deloitte | Accelerating Biopharmaceutical Breakthroughs with an Innovative Enterprise Data Strategy

AI and Genie: Analyzing Healthcare Improvement Opportunities

2025-06-10 Watch

talk

Jay Sharma (Premier Inc) , Tim Riddle (Premier Inc)

AI/ML BI Data Lakehouse SQL

This session is repeated. Improving healthcare impacts us all. We highlight how Premier Inc. took risk-adjusted patient data from more than 1,300 member hospitals across America, applying a natural language interface using AI/BI Genie, allowing our users to discover new insights. The stakes are high, new insights surfaced represent potential care improvement and lives positively impacted. Using Genie and our AI-ready data in Unity Catalog, our team was able to stand up a Genie instance in three short days, bypassing costs and time of custom modeling and application development. Additionally, Genie allowed our internal teams to generate complex SQL, as much as 10 times faster than writing it by hand. As Genie and lakehouse apps continue to advance rapidly, we are excited to leverage these features by introducing Genie to as many as 20,000 users across hundreds of hospitals. This will support our members’ ongoing mission to enhance the care they provide to the communities they serve.

Apache Iceberg with Unity Catalog at HelloFresh

2025-06-10 Watch

talk

Max Schultze (HelloFresh) , Adam Komisarek (HelloFresh)

Flink Data Lakehouse Databricks Delta Iceberg Snowflake

Table formats like Delta Lake and Iceberg have been game changers for pushing lakehouse architecture into modern Enterprises. The acquisition of Tabular added Iceberg to the Databricks ecosystem, an open format that was already well supported by processing engines across the industry. At HelloFresh we are building a lakehouse architecture that integrates many touchpoints and technologies all across the organization. As such we chose Iceberg as the table format to bridge the gaps in our decentralized managed tech landscape. We are leveraging Unity Catalog as the Iceberg REST catalog of choice for storing metadata and managing tables. In this talk we will outline our architectural setup between Databricks, Spark, Flink and Snowflake and will explain the native Unity Iceberg REST catalog, as well as catalog federation towards connected engines. We will highlight the impact on our business and discuss the advantages and lessons learned from our early adopter experience.

AT&T AutoClassify: Unified Multi-Head Binary Classification From Unlabeled Text

2025-06-10 Watch

talk

Hien Lam (AT&T) , Colton Peltier (Databricks)

Databricks LLM NLP

We present AT&T AutoClassify, built jointly between AT&T's Chief Data Office (CDO) and Databricks professional services, a novel end-to-end system for automatic multi-head binary classifications from unlabeled text data. Our approach automates the challenge of creating labeled datasets and training multi-head binary classifiers with minimal human intervention. Starting only from a corpus of unlabeled text and a list of desired labels, AT&T AutoClassify leverages advanced natural language processing techniques to automatically mine relevant examples from raw text, fine-tune embedding models and train individual classifier heads for multiple true/false labels. This solution can reduce LLM classification costs by 1,000x, making it an efficient solution in operational costs. The end result is a highly optimized and low-cost model servable in Databricks capable of taking raw text and producing multiple binary classifications. An example use case using call transcripts will be examined.

Bayada’s Snowflake-to-Databricks Migration: Transforming Data for Speed & Efficiency

2025-06-10 Watch

talk

Venkatesh Guruprasad (BAYADA Home Health Care) , PradeepKumar jain Vimalraj (Tredence Inc) , Elaine O'Neill (BAYADA Home Health Care)

AI/ML Analytics Databricks Matillion Snowflake SQL

Bayada is transforming its data ecosystem by consolidating Matillion+Snowflake and SSIS+SQL Server into a unified Enterprise Data Platform powered by Databricks. Using Databricks' Medallion architecture, this platform enables seamless data integration, advanced analytics and machine learning across critical domains like general ledger, recruitment and activity-based costing. Databricks was selected for its scalability, real-time analytics and ability to handle both structured and unstructured data, positioning Bayada for future growth. The migration aims to reduce data processing times by 35%, improve reporting accuracy and cut reconciliation efforts by 40%. Operational costs are projected to decrease by 20%, while real-time analytics is expected to boost efficiency by 15%. Join this session to learn how Bayada is leveraging Databricks to build a high-performance data platform that accelerates insights, drives efficiency and fosters innovation organization-wide.

Beyond Simple RAG: Unlocking Quality, Scale and Cost-Efficient Retrieval With Mosaic AI Vector Search

2025-06-10 Watch

talk

Ankit Vij (Databricks) , Adam Gurary (Databricks)

AI/ML Databricks RAG

This session is repeated. Mosaic AI Vector Search is powering high-accuracy retrieval systems in production across a wide range of use cases — including RAG applications, entity resolution, recommendation systems and search. Fully integrated with the Databricks Data Intelligence Platform, it eliminates pipeline maintenance by automatically syncing data from source to index. Over the past year, customers have asked for greater scale, better quality out-of-the-box and cost-efficient performance. This session delivers on those needs — showcasing best practices for implementing high-quality retrieval systems and revealing major product advancements that improve scalability, efficiency and relevance. What you’ll learn: How to optimize Vector Search with hybrid retrieval and reranking for better out-of-the-box results Best practices for managing vector indexes with minimal operational overhead Real-world examples of how organizations have scaled and improved their search and recommendation systems

Building a Seamless Multi-Cloud Platform for Secure Portable Workloads

2025-06-10 Watch

talk

James Burns (Databricks) , Scott Reynolds (Databricks)

Cloud Computing Databricks

There are many challenges to making a data platform actually a platform, something that hides complexity. Data engineers and scientists are looking for a simple and intuitive abstraction to focus on their work, not where it runs to maintain compliance, what credentials it uses to access data or how it generates operational telemetry. At Databricks we’ve developed a data-centric approach to workload development and deployment that enables data workers to stop doing migrations and instead develop with confidence. Attend this session to learn how to run simple, secure and compliant global multi-cloud workloads at scale on Databricks.

Chaos to Clarity: Secure, Scalable, and Governed SaaS Ingestion through Lakeflow Connect and more

2025-06-10 Watch

talk

Krishna Bhupatiraju (Databricks) , Prashant Gupta (Databricks)

API Databricks SaaS

Ingesting data from SaaS systems sounds straightforward—until you hit API limits, miss SLAs, or accidentally ingest PII. Sound familiar? In this talk, we’ll share how Databricks evolved from scrappy ingestion scripts to a unified, secure, and scalable ingestion platform. Along the way, we’ll highlight the hard lessons, the surprising pitfalls, and the tools that helped us level up. Whether you’re just starting to wrangle third-party data or looking to scale while handling governance and compliance, this session will help you think beyond pipelines and toward platform thinking.

Crafting Business Brilliance: Leveraging Databricks SQL for Next-Gen Applications

2025-06-10 Watch

talk

Mohammad Shalchi (Haleon) , Wasim Ahmad (Databricks)

API Data Lakehouse Databricks GenAI SAP SQL

At Haleon, we've leveraged Databricks APIs and serverless compute to develop customer-facing applications for our business. This innovative solution enables us to efficiently deliver SAP invoice and order management data through front-end applications developed and served via our API Gateway. The Databricks lakehouse architecture has been instrumental in eliminating the friction associated with directly accessing SAP data from operational systems, while enhancing our performance capabilities. Our system acheived response times of less than 3 seconds from API call, with ongoing efforts to optimise this performance. This architecture not only streamlines our data and application ecosystem but also paves the way for integrating GenAI capabilities with robust governance measures for our future infrastructure. The implementation of this solution has yielded significant benefits, including a 15% reduction in customer service costs and a 28% increase in productivity for our customer support team.

talk-data.com

Top Topics

Top Speakers

Power BI and Databricks: Practical Best Practices

Scaling Data Governance: How Unity Catalog is Empowering Picpay's Data Governance Strategy

Scaling Demand Forecasting at Nikon: Automating Camera Accessories Sales Planning with Databricks

Scaling XGBoost With Spark Connect ML on Grace Blackwell

Spark 4.0 and Delta 4.0 For Streaming Data

Spark Connect: Flexible, Local Access to Apache Spark at Scale

Unlocking the Future of Dairy Farming: Leveraging Data Marketplaces at Lely

AI Agents in Action: Structuring Unstructured Data on Demand With Databricks and Unstructured

Breaking Silos: Cigna’s Journey to Seamless Data Sharing with Delta Sharing

Cracking Complex Documents with Databricks Mosaic AI

From Spaghetti Bowl Pipeline to Lakeflow Declarative Pipelines Efficiency

Learn to Program Not Write Prompts with DSPy

Petabyte-Scale On-Chain Insights: Real-Time Intelligence for the Next-Gen Financial Backbone

Sponsored by: Deloitte | Accelerating Biopharmaceutical Breakthroughs with an Innovative Enterprise Data Strategy

Sponsored by: Neo4j | Get Your Data AI-Ready: Knowledge Graphs & GraphRAG for GenAI Success

Sponsored by: SAP | SAP and Databricks Open a Bold New Era of Data and AI

Sponsored by: Sigma | Flogistix by Flowco, and the Role of Data in Responsible Energy Production

AI and Genie: Analyzing Healthcare Improvement Opportunities

Apache Iceberg with Unity Catalog at HelloFresh

AT&T AutoClassify: Unified Multi-Head Binary Classification From Unlabeled Text

Bayada’s Snowflake-to-Databricks Migration: Transforming Data for Speed & Efficiency

Beyond Simple RAG: Unlocking Quality, Scale and Cost-Efficient Retrieval With Mosaic AI Vector Search

Building a Seamless Multi-Cloud Platform for Secure Portable Workloads

Chaos to Clarity: Secure, Scalable, and Governed SaaS Ingestion through Lakeflow Connect and more

Crafting Business Brilliance: Leveraging Databricks SQL for Next-Gen Applications