Data + AI Summit 2025

Getting Started With Lakeflow Connect

2025-06-10 Watch

talk

Peter Pogorski (Databricks) , Giselle Goicochea (Databricks)

AI/ML Analytics CI/CD Databricks Google Analytics postgresql

Hundreds of customers are already ingesting data with Lakeflow Connect from SQL Server, Salesforce, ServiceNow, Google Analytics, SharePoint, PostgreSQL and more to unlock the full power of their data. Lakeflow Connect introduces built-in, no-code ingestion connectors from SaaS applications, databases and file sources to help unlock data intelligence. In this demo-packed session, you’ll learn how to ingest ready-to-use data for analytics and AI with a few clicks in the UI or a few lines of code. We’ll also demonstrate how Lakeflow Connect is fully integrated with the Databricks Data Intelligence Platform for built-in governance, observability, CI/CD, automated pipeline maintenance and more. Finally, we’ll explain how to use Lakeflow Connect in combination with downstream analytics and AI tools to tackle common business challenges and drive business impact.

Improve AI Training With the First Synthetic Personas Dataset Aligned to Real-World Distributions

2025-06-10 Watch

talk

Dane Corneil (NVIDIA) , Yev Meyer (NVIDIA)

AI/ML Data Quality Databricks LLM

A big challenge in LLM development and synthetic data generation is ensuring data quality and diversity. While data incorporating varied perspectives and reasoning traces consistently improves model performance, procuring such data remains impossible for most enterprises. Human-annotated data struggles to scale, while purely LLM-based generation often suffers from distribution clipping and low entropy. In a novel compound AI approach, we combine LLMs with probabilistic graphical models and other tools to generate synthetic personas grounded in real demographic statistics. The approach allows us to address major limitations in bias, licensing, and persona skew of existing methods. We release the first open-source dataset aligned with real-world distributions and show how enterprises can leverage it with Gretel Data Designer (now part of NVIDIA) to bring diversity and quality to model training on the Databricks platform, all while addressing model collapse and data provenance concerns head-on.

Machine Learning Model Deployment

2025-06-10

talk

AI/ML Data Lakehouse Databricks Delta Python Scikit-learn

This course is designed to introduce three primary machine learning deployment strategies and illustrate the implementation of each strategy on Databricks. Following an exploration of the fundamentals of model deployment, the course delves into batch inference, offering hands-on demonstrations and labs for utilizing a model in batch inference scenarios, along with considerations for performance optimization. The second part of the course comprehensively covers pipeline deployment, while the final segment focuses on real-time deployment. Participants will engage in hands-on demonstrations and labs, deploying models with Model Serving and utilizing the serving endpoint for real-time inference. By mastering deployment strategies for a variety of use cases, learners will gain the practical skills needed to move machine learning models from experimentation to production. This course shows you how to operationalize AI solutions efficiently, whether it's automating decisions in real-time or integrating intelligent insights into data pipelines. Pre-requisites: Familiarity with Databricks workspace and notebooks, familiarity with Delta Lake and Lakehouse, intermediate level knowledge of Python (e.g. common Python libraries for DS/ML like Scikit-Learn, awareness of model deployment strategies) Labs: Yes Certification Path: Databricks Certified Machine Learning Associate

Open Source Unity Catalog: Getting Started, Best Practices and Governance at Scale

2025-06-10 Watch

talk

Tathagata Das (Databricks) , Ben Wilson (Databricks)

AI/ML

How to use UC OSS, what features are available, and intro to the ecosystem. We'll dive into the latest release and get hands-on with demos for working with your UC data and AI assets — including tables, volumes, models and AI functions.

Scaling Sales Excellence: How Databricks Uses Its Own Tech to Train GTM Teams

2025-06-10 Watch

talk

Sergio Ballesteros Solanas (Databricks)

AI/ML Databricks GenAI GTM

In this session, discover how Databricks leverages the power of Gen AI, MosaicML, Model Serving and Databricks Apps to revolutionize sales enablement. We’ll showcase how we built an advanced chatbot that equips our go-to-market team with the tools and knowledge needed to excel in customer-facing interactions. This AI-driven solution not only trains our salespeople but also enhances their confidence and effectiveness in demonstrating the transformative potential of Databricks to future customers. Attendees will gain insights into the architecture, development process and practical applications of this innovative approach. The session will conclude with an interactive demo, offering a firsthand look at the chatbot in action. Join us to explore how Databricks is using its own platform to drive sales excellence through cutting-edge AI solutions.

Self-Improving Agents and Agent Evaluation With Arize & Databricks ML Flow

2025-06-10 Watch

talk

Aparna Dhinakaran (Arize)

AI/ML Databricks

As autonomous agents become increasingly sophisticated and widely deployed, the ability for these agents to evaluate their own performance and continuously self-improve is essential. However, the growing complexity of these agents amplifies potential risks, including exposure to malicious inputs and generation of undesirable outputs. In this talk, we'll explore how to build resilient, self-improving agents. To drive self-improvement effectively, both the agent and the evaluation techniques must simultaneously improve with a continuously iterating feedback loop. Drawing from extensive real-world experiences across numerous productionized use cases, we will demonstrate practical strategies for combining tools from Arize, Databricks MLflow and Mosaic AI to evaluate and improve high-performing agents.

Sponsored by: Qlik | Turning Data into Business Impact: How to Build AI-Ready, Trusted Data Products on Databricks

The Future of Real Time Insights with Databricks and SAP

2025-06-10 Watch

talk

Alejandro Saucedo (Zalando SE) , Jon Levine (JPL) (Databricks) , Olaf Melchior (Zalando SE)

AI/ML Analytics Databricks ETL/ELT SAP SQL

Tired of waiting on SAP data? Join this session to see how Databricks and SAP make it easy to query business-ready data—no ETL. With Databricks SQL, you’ll get instant scale, automatic optimizations, and built-in governance across all your enterprise analytics data. Fast and AI-powered insights from SAP data are finally possible—and this is how.

The Hitchhiker's Guide to Delta Lake Streaming in an Agentic Universe

2025-06-10 Watch

talk

Scott Haines (Nike)

AI/ML Data Engineering Delta LLM Data Streaming

As data engineering continues to evolve the shift from batch-oriented to streaming-first has become standard across the enterprise. The reality is these changes have been taking shape for the past decade — we just now also happen to be standing on the precipice of true disruption through automation, the likes of which we could only dream about before. Yes, AI Agents and LLMs are already a large part of our daily lives, but we (as data engineers) are ultimately on the frontlines ensuring that the future of AI is powered by consistent, just-in-time data — and Delta Lake is critical to help us get there. This session will provide you with best practices learned the hard way by one of the authors of The Delta Lake Definitive Guide including: Guide to writing generic applications as components Workflow automation tips and tricks Tips and tricks for Delta clustering (liquid, z-order, and classic) Future facing: Leveraging metadata for agentic pipelines and workflow automation

ThredUp’s Journey with Databricks: Modernizing Our Data Infrastructure

2025-06-10 Watch

talk

Aniket Mane (ThredUp Inc.) , Chintan Patel (Thredup)

AI/ML Analytics Data Management Databricks Delta SQL

Building an AI-ready data platform requires strong governance, performance optimization, and seamless adoption of new technologies. At ThredUp, our Databricks journey began with a need for better data management and evolved into a full-scale transformation powering analytics, machine learning, and real-time decision-making. In this session, we’ll cover: Key inflection points: Moving from legacy systems to a modernized Delta Lake foundation Unity Catalog’s impact: Improving governance, access control, and data discovery Best practices for onboarding: Ensuring smooth adoption for engineering and analytics teams What’s next? Serverless SQL and conversational analytics with Genie Whether you’re new to Databricks or scaling an existing platform, you’ll gain practical insights on navigating the transition, avoiding pitfalls, and maximizing AI and data intelligence.

Transforming Government With Data and AI: Singapore GovTech's Journey With Databricks

2025-06-10 Watch

talk

Sachin Tonk (GovTech)

AI/ML Analytics BI Data Analytics Data Management Data Quality

GovTech is an agency in the Singapore Government focused on tech for good. The GovTech Chief Data Office (CDO) has built the GovTech Data Platform with Databricks at the core. As the government tech agency, we safeguard national-level government and citizen data. A comprehensive data strategy is essential to uplifting data maturity. GovTech has adopted the service model approach where data services are offered to stakeholders based on their data maturity. Their maturity is uplifted through partnership, readying them for more advanced data analytics. CDO offers a plethora of data assets in a “data restaurant” ranging from raw data to data products, all delivered via Databricks and enabled through fine-grained access control, underpinned by data management best practices such as data quality, security and governance. Within our first year on Databricks, CDO was able to save 8,000 man-hours, democratize data across 50% of the agency and achieve six-figure savings through BI consolidation.

Ursa: Augment Your Lakehouse With Kafka-Compatible Data Streaming Capabilities

2025-06-10 Watch

talk

Gaurav Saxena (Automotive Industry) , Sijie Guo (StreamNative)

AI/ML API Data Lakehouse Delta GenAI Iceberg

As data architectures evolve to meet the demands of real-time GenAI applications, organizations increasingly need systems that unify streaming and batch processing while maintaining compatibility with existing tools. The Ursa Engine offers a Kafka-API-compatible data streaming engine built on Lakehouse (Iceberg and Delta Lake). Designed to seamlessly integrate with data lakehouse architectures, Ursa extends your lakehouse capabilities by enabling streaming ingestion, transformation and processing — using a Kafka-compatible interface. In this session, we will explore how Ursa Engine augments your existing lakehouses with Kafka-compatible capabilities. Attendees will gain insights into Ursa Engine architecture and real-world use cases of Ursa Engine. Whether you're modernizing legacy systems or building cutting-edge AI-driven applications, discover how Ursa can help you unlock the full potential of your data.

Validating Clinical Trial Platforms on Databricks

2025-06-10 Watch

talk

Kamesh Raghavendra (Purgo AI) , Neha Pande (Databricks)

AI/ML Databricks GenAI

Clinical Trial Data is undergoing a renaissance with new insights and data sources being added daily. The speed of new innovations and modalities that are found within trials poses an existential dilemma for 21CFR Part 11 compliance. In these validated environments, new components and methods need to be tested for reproducibility and restricted data access. In classical systems, this validation process would often have taken three months or more due to the manual validation process via validation scripts like Installation Qualification (IQ) and Operational Qualification (OQ) scripts. In conjunction with Databricks, Purgo AI has developed a new technology leveraging generative AI to automate the execution of IQ and OQ scripts and has drastically reduced the amount of time for validating Databricks from three months to less than a day. This drastic speedup of validation will enable the continuous flow of new ideas and implementations for clinical trials.

Advanced Machine Learning Operations

2025-06-09

talk

AI/ML CI/CD Data Lakehouse Databricks Git GitHub

The course is designed to cover advanced concepts and workflows in machine learning operations. It starts by introducing participants to continuous integration (CI) and continuous development (CD) workflows within machine learning projects, guiding them through the deployment of a sample CI/CD workflow using Databricks in the first section. Moving on to the second part, participants delve into data and model testing, where they actively create tests and automate CI/CD workflows. Finally, the course concludes with an exploration of model monitoring concepts, demonstrating the use of Lakehouse Monitoring to oversee machine learning models in production settings. Pre-requisites: Familiarity with Databricks workspace and notebooks; knowledge of machine learning model development and deployment with MLflow (e.g. intermediate-level knowledge of traditional ML concepts, development with CI/CD, the use of Python and Git for ML projects with popular platforms like GitHub) Labs: Yes Certification Path: Databricks Certified Machine Learning Professional

AI/BI for Data Analysts

2025-06-09

talk

AI/ML Analytics BI Databricks SQL

In this course, you’ll learn how to use the features Databricks provides for business intelligence needs: AI/BI Dashboards and AI/BI Genie. As a Databricks Data Analyst, you will be tasked with creating AI/BI Dashboards and AI/BI Genie Spaces within the platform, managing the access to these assets by stakeholders and necessary parties, and maintaining these assets as they are edited, refreshed, or decommissioned over the course of their lifespan. This course intends to instruct participants on how to design dashboards for business insights, share those with collaborators and stakeholders, and maintain those assets within the platform. Participants will also learn how to utilize AI/BI Genie Spaces to support self-service analytics through the creation and maintenance of these environments powered by the Databricks Data Intelligence Engine. Pre-requisites: The content was developed for participants with these skills/knowledge/abilities: A basic understanding of SQL for querying existing data tables in Databricks. Prior experience or basic familiarity with the Databricks Workspace UI. A basic understanding of the purpose and use of statistical analysis results. Familiarity with the concepts around dashboards used for business intelligence. Labs: Yes

Build Data Pipelines with Lakeflow Declarative Pipelines

2025-06-09

talk

AI/ML API Cloud Computing Data Engineering Databricks Python

In this course, you’ll learn how to define and schedule data pipelines that incrementally ingest and process data through multiple tables on the Data Intelligence Platform, using Lakeflow Declarative Pipelines in Spark SQL and Python. We’ll cover topics like how to get started with Lakeflow Declarative Pipelines, how Lakeflow Declarative Pipelines tracks data dependencies in data pipelines, how to configure and run data pipelines using the Lakeflow Declarative Pipelines. UI, how to use Python or Spark SQL to define data pipelines that ingest and process data through multiple tables on the Data Intelligence Platform, using Auto Loader and Lakeflow Declarative Pipelines, how to use APPLY CHANGES INTO syntax to process Change Data Capture feeds, and how to review event logs and data artifacts created by pipelines and troubleshoot syntax.By streamlining and automating reliable data ingestion and transformation workflows, this course equips you with the foundational data engineering skills needed to help kickstart AI use cases. Whether you're preparing high-quality training data or enabling real-time AI-driven insights, this course is a key step in advancing your AI journey.Pre-requisites: Beginner familiarity with the Databricks Data Intelligence Platform (selecting clusters, navigating the Workspace, executing notebooks), cloud computing concepts (virtual machines, object storage, etc.), production experience working with data warehouses and data lakes, intermediate experience with basic SQL concepts (select, filter, groupby, join, etc), beginner programming experience with Python (syntax, conditions, loops, functions), beginner programming experience with the Spark DataFrame API (Configure DataFrameReader and DataFrameWriter to read and write data, Express query transformations using DataFrame methods and Column expressions, etc.)Labs: NoCertification Path: Databricks Certified Data Engineer Associate

Gen AI Application Development

2025-06-09

talk

AI/ML Databricks GenAI LLM NLP RAG

This course provides participants with information and practical experience in building advanced LLM (Large Language Model) applications using multi-stage reasoning LLM chains and agents. In the initial section, participants will learn how to decompose a problem into its components and select the most suitable model for each step to enhance business use cases. Following this, participants will construct a multi-stage reasoning chain utilizing LangChain and HuggingFace transformers. Finally, participants will be introduced to agents and will design an autonomous agent using generative models on Databricks. Pre-requisites: Solid understanding of natural language processing (NLP) concepts, familiarity with prompt engineering and prompt engineering best practices, experience with the Databricks Data Intelligence Platform, experience with retrieval-augmented generation (RAG) techniques including data preparation, building RAG architectures, and concepts like embeddings, vectors, and vector databases Labs: Yes Certification Path: Databricks Certified Generative AI Engineer Associate

Machine Learning Model Development

2025-06-09

talk

AI/ML Data Lakehouse Databricks Delta Python Scikit-learn

In this course, you’ll learn how to develop traditional machine learning models on Databricks. We’ll cover topics like using popular ML libraries, executing common tasks efficiently with AutoML and MLflow, harnessing Databricks' capabilities to track model training, leveraging feature stores for model development, and implementing hyperparameter tuning. Additionally, the course covers AutoML for rapid and low-code model training, ensuring that participants gain practical, real-world skills for streamlined and effective machine learning model development in the Databricks environment. Pre-requisites: Familiarity with Databricks workspace and notebooks, familiarity with Delta Lake and Lakehouse, intermediate level knowledge of Python (e.g. common Python libraries for DS/ML like Scikit-Learn, fundamental ML algorithms like regression and classification, model evaluation with common metrics) Labs: Yes Certification Path: Databricks Certified Machine Learning Associate

AI/BI for Self-Service Analytics

2025-06-09 Watch

talk

AI/ML Analytics BI Databricks

In this course, you will learn how to self-serve business insights from your company’s Databricks Data Intelligence Platform using AI/BI. After a tour of the fundamental components of the platform, you’ll learn how to interact with pre-created AI/BI Dashboards to explore your company’s data through existing charts and visualizations. You’ll also learn how to use AI/BI Genie to go beyond dashboards by asking follow-up questions in natural language to self-serve new insights, create visualizations, and share them with your colleagues. Pre-requisites: A working understanding of your organization’s business and key performance indicators. Labs: No Certification Path: N/A

Data Ingestion with Lakeflow Connect

2025-06-09 Watch

talk

AI/ML API Cloud Computing Databricks Delta ETL/ELT

In this course, you’ll learn how to have efficient data ingestion with Lakeflow Connect and manage that data. Topics include ingestion with built-in connectors for SaaS applications, databases and file sources, as well as ingestion from cloud object storage, and batch and streaming ingestion. We'll cover the new connector components, setting up the pipeline, validating the source and mapping to the destination for each type of connector. We'll also cover how to ingest data with Batch to Streaming ingestion into Delta tables, using the UI with Auto Loader, automating ETL with Lakeflow Declarative Pipelines or using the API.This will prepare you to deliver the high-quality, timely data required for AI-driven applications by enabling scalable, reliable, and real-time data ingestion pipelines. Whether you're supporting ML model training or powering real-time AI insights, these ingestion workflows form a critical foundation for successful AI implementation.Pre-requisites: Beginner familiarity with the Databricks Data Intelligence Platform (selecting clusters, navigating the Workspace, executing notebooks), cloud computing concepts (virtual machines, object storage, etc.), production experience working with data warehouses and data lakes, intermediate experience with basic SQL concepts (select, filter, groupby, join, etc), beginner programming experience with Python (syntax, conditions, loops, functions), beginner programming experience with the Spark DataFrame API (Configure DataFrameReader and DataFrameWriter to read and write data, Express query transformations using DataFrame methods and Column expressions, etc.Labs: NoCertification Path: Databricks Certified Data Engineer Associate

Data Preparation for Machine Learning

2025-06-09

talk

AI/ML DataViz Databricks Matplotlib Pandas PySpark

In this course, you’ll learn the fundamentals of preparing data for machine learning using Databricks. We’ll cover topics like exploring, cleaning, and organizing data tailored for traditional machine learning applications. We’ll also cover data visualization, feature engineering, and optimal feature storage strategies. By building a strong foundation in data preparation, this course equips you with the essential skills to create high-quality datasets that can power accurate and reliable machine learning and AI models. Whether you're developing predictive models or enabling downstream AI applications, these capabilities are critical for delivering impactful, data-driven solutions. Pre-requisites: Familiarity with Databricks workspace, notebooks, as well as Unity Catalog. An intermediate level knowledge of Python (scikit-learn, Matplotlib), Pandas, and PySpark. As well as with concepts of exploratory data analysis, feature engineering, standardization, and imputation methods). Labs: Yes Certification Path: Databricks Certified Machine Learning Associate

Gen AI Solution Development

2025-06-09

talk

AI/ML Databricks GenAI RAG Vector DB

This course is designed to introduce participants to contextual GenAI (generative artificial intelligence) solutions using the retrieval-augmented generation (RAG) method. Firstly, participants will be introduced to the RAG architecture and the significance of contextual information using Mosaic AI Playground. Next, the course will demonstrate how to prepare data for GenAI solutions and connect this process with building an RAG architecture. Finally, participants will explore concepts related to context embedding, vectors, vector databases, and the utilization of the Mosaic AI Vector Search product. Pre-requisites: Familiarity with embeddings, prompt engineering best practices, and experience with the Databricks Data Intelligence Platform Labs: Yes Certification Path: Databricks Certified Generative AI Engineer Associate

Machine Learning at Scale

2025-06-09

talk

AI/ML API Databricks Pandas Python Spark

The course intends to equip professional-level machine learning practitioners with knowledge and hands-on experience in utilizing Apache Spark™ for machine learning purposes, including model fine-tuning. Additionally, the course covers using the Pandas library for scalable machine learning tasks. The initial section of the course focuses on comprehending the fundamentals of Apache Spark™ along with its machine learning capabilities. Subsequently, the second section delves into fine-tuning models using the hyperopt library. The final segment involves learning the implementation of the Pandas API within Apache Spark™, encompassing guidance on Pandas UDFs (User-Defined Functions) and the Functions API for model inference. Pre-requisites: Familiarity with Databricks workspace and notebooks; knowledge of machine learning model development and deployment with MLflow (e.g. basic understanding of DS/ML concepts, common model metrics and python libraries as well as a basic understanding of scaling workloads with Spark) Labs: Yes Certification Path: Databricks Certified Machine Learning Professional

AI Agents Hackathon

2025-06-09

talk

AI/ML Databricks

This in-person, full-day hackathon focuses on the development of innovative AI Agents using the Databricks Data Intelligence Platform. Collaborating in teams of up to four, participants will utilize Databricks' specialized agent authoring and evaluation tools to build, test, and refine intelligent agent systems. Diverse datasets from the Databricks Marketplace are available to enhance agent capabilities. The objective is to produce a compelling proof-of-concept agent showcasing creativity, intelligent data utilization, and effective tool-calling in a novel and useful manner. This event provides a platform for demonstrating technical quality with Databricks tools, creativity in agent design or application, and clarity of purpose. The hackathon promotes hands-on experience with cutting-edge agent development tools and concludes with short team demonstrations of proofs of concept created during the event. Three finalist teams will be selected, and the winners will be announced at the end of the Hackathon. Cash prizes will be awarded to the top teams, with $10,000 for first place, $5,000 for second place, and $2,500 for third place. Complete details regarding eligibility and the rules governing this hackathon are available in the official rules, available at http://bit.ly/44HRyxz. In the event of any discrepancies between the official rules and other hackathon materials, the official rules govern. Agenda: 7:30am Registration/Breakfast 8:15am Opening Ceremony 8:30am Hacking Begins 12:00pm- 1:30pm Lunch 2:30pm Hacking Ends 2:30pm- 3:45pm Expo/Judging 3:45pm Closing Ceremony/Winners Announced 4:00pm Hackathon Ends

Get started with Data Warehousing

talk

AI/ML Cloud Computing Data Lakehouse Databricks DWH

This course provides a comprehensive overview of Databricks’ modern approach to data warehousing, highlighting how a data lakehouse architecture combines the strengths of traditional data warehouses with the flexibility and scalability of the cloud. You’ll learn about the AI-driven features that enhance data transformation and analysis on the Databricks Data Intelligence Platform. Designed for data warehousing practitioners, this course provides you with the foundational information needed to begin building and managing high-performant, AI-powered data warehouses on Databricks. This course is designed for those starting out in data warehousing and those who would like to execute data warehousing workloads on Databricks. Participants may also include data warehousing practitioners who are familiar with traditional data warehousing techniques and concepts and are looking to expand their understanding of how data warehousing workloads are executed on Databricks.

talk-data.com

Top Topics

Top Speakers

Getting Started With Lakeflow Connect

Improve AI Training With the First Synthetic Personas Dataset Aligned to Real-World Distributions

Machine Learning Model Deployment

Open Source Unity Catalog: Getting Started, Best Practices and Governance at Scale

Scaling Sales Excellence: How Databricks Uses Its Own Tech to Train GTM Teams

Self-Improving Agents and Agent Evaluation With Arize & Databricks ML Flow

Sponsored by: Qlik | Turning Data into Business Impact: How to Build AI-Ready, Trusted Data Products on Databricks

The Future of Real Time Insights with Databricks and SAP

The Hitchhiker's Guide to Delta Lake Streaming in an Agentic Universe

ThredUp’s Journey with Databricks: Modernizing Our Data Infrastructure

Transforming Government With Data and AI: Singapore GovTech's Journey With Databricks

Ursa: Augment Your Lakehouse With Kafka-Compatible Data Streaming Capabilities

Validating Clinical Trial Platforms on Databricks

Advanced Machine Learning Operations

AI/BI for Data Analysts

Build Data Pipelines with Lakeflow Declarative Pipelines

Gen AI Application Development

Machine Learning Model Development

AI/BI for Self-Service Analytics

Data Ingestion with Lakeflow Connect

Data Preparation for Machine Learning

Gen AI Solution Development

Machine Learning at Scale

AI Agents Hackathon

Get started with Data Warehousing