Data + AI Summit 2025

Swimming at Our Own Lakehouse: How Databricks Uses Databricks

2025-06-10 Watch

talk

Alan Jackoway (Databricks) , Bruce Wong (Databricks)

AI/ML Analytics Cloud Computing Data Lakehouse Databricks Marketing

This session is repeated. Peek behind the curtain to learn how Databricks processes hundreds of petabytes of data across every region and cloud where we operate. Learn how Databricks leverages Data and AI to scale and optimize every aspect of the company. From facilities and legal to sales and marketing and of course product research and development. This session is a high-level tour inside Databricks to see how Data and AI enable us to be a better company. We will go into the architecture of things for how Databricks is used for internal use cases like business analytics and SIEM as well as customer-facing features like system tables and assistant. We will cover how data production of our data flow and how we maintain security and privacy while operating a large multi-cloud, multi-region environment.

Trust You Can Measure: Data Quality Standards in The Lakehouse

2025-06-10 Watch

talk

Amit Pahwa (Databricks) , Sergiy Kanyshchev (Databricks)

Data Governance Data Lakehouse Data Quality Databricks

Do you trust your data? If you’ve ever struggled to figure out which datasets are reliable, well-governed, or safe to use, you’re not alone. At Databricks, our own internal lakehouse faced the same challenge—hundreds of thousands of tables, but no easy way to tell which data met quality standards. In this talk, the Databricks Data Platform team shares how we tackled this problem by building the Data Governance Score—a way to systematically measure and surface trust signals across the entire lakehouse. You’ll learn how we leverage Unity Catalog, governed tags, and enforcement to drive better data decisions at scale. Whether you're a data engineer, platform owner, or business leader, you’ll leave with practical ideas on how to raise the bar for data quality and trust in your own data ecosystem.

The JLL Training and Upskill Program for Our Warehouse Migration to Databricks

2025-06-10 Watch

lightning_talk

Kristopher Curtis (JLL)

Cloud Computing Data Lakehouse Databricks DWH

Databricks Odyssey is JLL’s bespoke training program designed to upskill and prepare data professionals for a new world of data lakehouse. Based on the concepts of learn, practice and certify, participants earn points, moving through five levels by completing activities with business application of Databricks key features. Databricks Odyssey facilitates cloud data warehousing migration by providing best practice frameworks, ensuring efficient use of pay-per-compute platforms. JLL/T Insights and Data fosters a data culture through learning programs that develop in-house talent and create career pathways. Databricks Odyssey offers: JLL-specific hands-on learning Gamified 'level up' approach Practical, applicable skills Benefits include: Improved platform efficiency Enhanced data accuracy and client insights Ongoing professional development Potential cost savings through better utilization

Accelerating Model Development and Fine-Tuning on Databricks with TwelveLabs

2025-06-10 Watch

talk

Wenwen Gao (NVIDIA) , Aiden Lee (Twelve Labs, Inc)

AI/ML Data Lakehouse Data Management Databricks Delta LLM

Scaling large language models (LLMs) and multimodal architectures requires efficient data management and computational power. NVIDIA NeMo Framework Megatron-LM on Databricks is an open source solution that integrates GPU acceleration and advanced parallelism with Databricks Delta Lakehouse, streamlining workflows for pre-training and fine-tuning models at scale. This session highlights context parallelism, a unique NeMo capability for parallelizing over sequence lengths, making it ideal for video datasets with large embeddings. Through the case study of TwelveLabs’ Pegasus-1 model, learn how NeMo empowers scalable multimodal AI development, from text to video processing, setting a new standard for LLM workflows.

Petabyte-Scale On-Chain Insights: Real-Time Intelligence for the Next-Gen Financial Backbone

2025-06-10 Watch

lightning_talk

Leo Liang (CipherOwl Inc)

AI/ML Analytics Arrow Blockchain Data Lakehouse Delta

We’ll explore how CipherOwl Inc. constructed a near real-time, multi-chain data lakehouse to power anti-money laundering (AML) monitoring at a petabyte scale. We will walk through the end-to-end architecture, which integrates cutting-edge open-source technologies and AI-driven analytics to handle massive on-chain data volumes seamlessly. Off-chain intelligence complements this to meet rigorous AML requirements. At the core of our solution is ChainStorage, an OSS started by Coinbase that provides robust blockchain data ingestion and block-level serving. We enhanced it with Apache Spark™ and Arrow™, coupled for high-throughput processing and efficient data serialization, backed by Delta Lake and Kafka. For the serving layer, we employ StarRocks to deliver lightning-fast SQL analytics over vast datasets. Finally, our system incorporates machine learning and AI agents for continuous data curation and near real-time insights, which are crucial for tackling on-chain AML challenges.

AI and Genie: Analyzing Healthcare Improvement Opportunities

2025-06-10 Watch

talk

Jay Sharma (Premier Inc) , Tim Riddle (Premier Inc)

AI/ML BI Data Lakehouse SQL

This session is repeated. Improving healthcare impacts us all. We highlight how Premier Inc. took risk-adjusted patient data from more than 1,300 member hospitals across America, applying a natural language interface using AI/BI Genie, allowing our users to discover new insights. The stakes are high, new insights surfaced represent potential care improvement and lives positively impacted. Using Genie and our AI-ready data in Unity Catalog, our team was able to stand up a Genie instance in three short days, bypassing costs and time of custom modeling and application development. Additionally, Genie allowed our internal teams to generate complex SQL, as much as 10 times faster than writing it by hand. As Genie and lakehouse apps continue to advance rapidly, we are excited to leverage these features by introducing Genie to as many as 20,000 users across hundreds of hospitals. This will support our members’ ongoing mission to enhance the care they provide to the communities they serve.

Apache Iceberg with Unity Catalog at HelloFresh

2025-06-10 Watch

talk

Max Schultze (HelloFresh) , Adam Komisarek (HelloFresh)

Flink Data Lakehouse Databricks Delta Iceberg Snowflake

Table formats like Delta Lake and Iceberg have been game changers for pushing lakehouse architecture into modern Enterprises. The acquisition of Tabular added Iceberg to the Databricks ecosystem, an open format that was already well supported by processing engines across the industry. At HelloFresh we are building a lakehouse architecture that integrates many touchpoints and technologies all across the organization. As such we chose Iceberg as the table format to bridge the gaps in our decentralized managed tech landscape. We are leveraging Unity Catalog as the Iceberg REST catalog of choice for storing metadata and managing tables. In this talk we will outline our architectural setup between Databricks, Spark, Flink and Snowflake and will explain the native Unity Iceberg REST catalog, as well as catalog federation towards connected engines. We will highlight the impact on our business and discuss the advantages and lessons learned from our early adopter experience.

Crafting Business Brilliance: Leveraging Databricks SQL for Next-Gen Applications

2025-06-10 Watch

talk

Mohammad Shalchi (Haleon) , Wasim Ahmad (Databricks)

API Data Lakehouse Databricks GenAI SAP SQL

At Haleon, we've leveraged Databricks APIs and serverless compute to develop customer-facing applications for our business. This innovative solution enables us to efficiently deliver SAP invoice and order management data through front-end applications developed and served via our API Gateway. The Databricks lakehouse architecture has been instrumental in eliminating the friction associated with directly accessing SAP data from operational systems, while enhancing our performance capabilities. Our system acheived response times of less than 3 seconds from API call, with ongoing efforts to optimise this performance. This architecture not only streamlines our data and application ecosystem but also paves the way for integrating GenAI capabilities with robust governance measures for our future infrastructure. The implementation of this solution has yielded significant benefits, including a 15% reduction in customer service costs and a 28% increase in productivity for our customer support team.

Reimagining Data Governance and Access at Atlassian

2025-06-10 Watch

lightning_talk

Gerald Nakhle (Atlassian)

Agile/Scrum Data Governance Data Lakehouse Cyber Security

Atlassian is rebuilding its central lakehouse from the ground up to deliver a more secure, flexible and scalable data environment. In this session, we’ll share how we leverage Unity Catalog for fine-grained governance and supplement it with Immuta for dynamic policy management, enabling row and column level security at scale. By shifting away from broad, monolithic access controls toward a modern, agile solution, we’re empowering teams to securely collaborate on sensitive data without sacrificing performance or usability. Join us for an inside look at our end-to-end policy architecture, from how data owners declare metadata and author policies to the seamless application of access rules across the platform. We’ll also discuss lessons learned on streamlining data governance, ensuring compliance, and improving user adoption. Whether you’re a data architect, engineer or leader, walk away with actionable strategies to simplify and strengthen your own governance and access practices.

Revolutionizing Cybersecurity: SCB's Journey to a Self-Managed SIEM

2025-06-10 Watch

talk

Lavy Stokhamer (Standard Chartered Bank)

AI/ML Cloud Computing Data Lakehouse Databricks

Join us to explore how Standard Chartered Bank's (SCB) groundbreaking strategy is reshaping the future of the cybersecurity landscape by replacing traditional SIEM with a cutting-edge Databricks solution, achieving remarkable business outcomes: 80% Reduction in time to detect incidents 92% Faster threat investigation 35% Cost reduction 60% Better detection accuracy Significant enhancements in threat detection and response metrics Substantial increase in ML-driven use cases This session unveils SCB's journey to a distributed, multi-cloud lakehouse architecture that unlocks unprecedented performance and commercial optimization. Explore why a unified data and AI platform is becoming the cornerstone of next-generation, self-managed SIEM solutions for forward-thinking organizations in this era of AI-powered banking transformation.

Sponsored by: Accenture & Avanade | Enterprise Data Journey for The Standard Insurance Leveraging Databricks on Azure and AI Innovation

Data Modeling 101 for Data Lakehouse Demystified

2025-06-10 Watch

talk

Luan Moreno Medeiros Maciel (Pythian)

Data Lakehouse Data Modelling

This session is repeated. In today’s data-driven world, the Data Lakehouse has emerged as a powerful architectural paradigm that unifies the flexibility of data lakes with the reliability and structure of traditional data warehouses. However, organizations must adopt the right data modeling techniques to unlock its full potential to ensure scalability, maintainability and efficiency. This session is designed for beginners looking to demystify the complexities of data modeling for the lakehouse and make informed design decisions. We’ll break down Medallion Architecture, explore key data modeling techniques and walk through the maturity stages of a successful data platform — transitioning from raw, unstructured data to well-organized, query-efficient models.

Delta Lake and the Data Mesh

2025-06-10 Watch

talk

KyJah Keys (Nextdata)

Data Lakehouse Databricks Delta DuckDB Polars Spark

Delta Lake has proven to be an excellent storage format. Coupled with the Databricks platform, the storage format has shined as a component of a distributed system on the lakehouse. The pairing of Delta and Spark provides an excellent platform, but users often struggle to perform comparable work outside of the Spark ecosystem. Tools such as delta-rs, Polars and DuckDb have brought access to users outside of Spark, but they are only building blocks of a larger system. In this 40-minute talk we will demonstrate how users can use data products on the Nextdata OS data mesh to interact with the Databricks platform to drive Delta Lake workflows. Additionally, we will show how users can build autonomous data products that interact with their Delta tables both inside and outside of the lakehouse platform. Attendees will learn how to integrate the Nextdata OS data mesh with the Databricks platform as both an external and integral component.

Migrating Legacy SAS Code to Databricks Lakehouse: What We Learned Along the Way

2025-06-10 Watch

talk

Dmitriy Alergant (Tier One Analytics Inc.) , Matt Adams (PacificSource Health Plans)

Analytics Data Lakehouse Databricks DWH SAS

In PacificSource Health Plans, a health insurance company in the US, we are on a successful multi-year journey to migrate all of our data and analytics ecosystem to Databricks Enterprise Data Warehouse (lakehouse). A particular obstacle on this journey was a reporting data mart which relied on copious amounts of legacy SAS code that applied sophisticated business logic transformations for membership, claims, premiums and reserves. This core data mart was driving many of our critical reports and analytics. In this session we will share the unique and somewhat unexpected challenges and complexities we encountered in migrating this legacy SAS code. How our partner (T1A) leveraged automation technology (Alchemist) and some unique approaches to reverse engineer (analyze), instrument, translate, migrate, validate and reconcile these jobs; and what lessons we learned and carried from this migration effort.

Simplifying Training and GenAI Finetuning Using Serverless GPU Compute

2025-06-10 Watch

talk

Tejas Sundaresan (Databricks)

AI/ML Data Lakehouse Databricks GenAI LLM PyTorch

The last year has seen the rapid progress of Open Source GenAI models and frameworks. This talk covers best practices for custom training and OSS GenAI finetuning on Databricks, powered by the newly announced Serverless GPU Compute. We’ll cover how to use Serverless GPU compute to power AI training/GenAI finetuning workloads and framework support for libraries like LLM Foundry, Composer, HuggingFace, and more. Lastly, we’ll cover how to leverage MLFlow and the Databricks Lakehouse to streamline the end to end development of these models. Key takeaways include: How Serverless GPU compute saves customers valuable developer time and overhead when dealing with GPU infrastructure Best practices for training custom deep learning models (forecasting, recommendation, personalization) and finetuning OSS GenAI Models on GPUs across the Databricks stack Leveraging distributed GPU training frameworks (e.g. Pytorch, Huggingface) on Databricks Streamlining the path to production for these models Join us to learn about the newly announced Serverless GPU Compute and the latest updates to GPU training and finetuning on Databricks!

Unify Your Data and Governance With Lakehouse Federation

2025-06-10 Watch

talk

Zeashan Pappa (Databricks) , Fuat Can Efeoglu (Databricks)

AI/ML Analytics Data Lakehouse Hive Snowflake SQL

In today's data landscape, organizations often grapple with fragmented data spread across various databases, data warehouses and catalogs. Lakehouse Federation addresses this challenge by enabling seamless discovery, querying, and governance of distributed data without the need for duplication or migration. This session will explore how Lakehouse Federation integrates external data sources like Hive Metastore, Snowflake, SQL Server and more into a unified interface, providing consistent access controls, lineage tracking and auditing across your entire data estate. Learn how to streamline analytics and AI workloads, enhance compliance and reduce operational complexity by leveraging a single, cohesive platform for all your data needs.

Advanced RAG Overview — Thawing Your Frozen RAG Pipeline

2025-06-10 Watch

talk

James Lin (Experian) , Jason Li (Experian)

Data Lakehouse Databricks RAG

The most common RAG systems rely on a frozen RAG system — one where there’s a single embedding model and single vector index. We’ve achieved a modicum of success with that, but when it comes to increasing accuracy for production systems there is only so much this approach solves. In this session we will explore how to move from the frozen systems to adaptive RAG systems which produce more tailored outputs with higher accuracy. Databricks services: Lakehouse, Unity Catalog, Mosaic, Sweeps, Vector Search, Agent Evaluation, Managed Evaluation, Inference Tables

Empowering Healthcare Insights: A Unified Lakehouse Approach With Databricks

2025-06-10 Watch

talk

BIANCA STRATULAT (BJSS) , Mike Dobing (Databricks)

AWS Azure Data Lake Data Lakehouse Databricks Iceberg

NHS England is revolutionizing healthcare research by enabling secure, seamless access to de-identified patient data through the Federated Data Platform (FDP). Despite vast data resources spread across regional and national systems, analysts struggle with fragmented, inconsistent datasets. Enter Databricks: powering a unified, virtual data lake with Unity Catalog at its core — integrating diverse NHS systems while ensuring compliance and security. By bridging AWS and Azure environments with a private exchange and leveraging the Iceberg connector to interface with Palantir, analysts gain scalable, reliable and governed access to vital healthcare data. This talk explores how this innovative architecture is driving actionable insights, accelerating research and ultimately improving patient outcomes.

How an Open, Scalable and Secure Data Platform is Powering Quick Commerce Swiggy's AI

2025-06-10 Watch

talk

Vasan Vembu Srini (Databricks) , Akash Agarwal (Swiggy)

AI/ML Analytics Flink Data Lakehouse Databricks Delta

Swiggy, India's leading quick commerce platform, serves ~13 million users across 653 cities, with 196,000 restaurant partners and 17,000 SKUs. To handle this scale, Swiggy developed a secure, scalable AI platform processing millions of predictions per second. The tech stack includes Apache Kafka for real-time streaming, Apache Spark on Databricks for analytics and ML, and Apache Flink for stream processing. The Lakehouse architecture on Delta ensures data reliability, while Unity Catalog enables centralized access control and auditing. These technologies power critical AI applications like demand forecasting, route optimization, personalized recommendations, predictive delivery SLAs, and generative AI use cases.Key Takeaway:This session explores building a data platform at scale, focusing on cost efficiency, simplicity, and speed, empowering Swiggy to seamlessly support millions of users and AI use cases.

Lakeflow Connect: Smarter, Simpler File Ingestion With the Next Generation of Auto Loader

2025-06-10 Watch

talk

Sandip Agarwala (Databricks) , Chavdar Botev (Databricks)

API Cloud Computing Cloud Storage Data Lakehouse Data Quality Databricks

Auto Loader is the definitive tool for ingesting data from cloud storage into your lakehouse. In this session, we’ll unveil new features and best practices that simplify every aspect of cloud storage ingestion. We’ll demo out-of-the-box observability for pipeline health and data quality, walk through improvements for schema management, introduce a series of new data formats and unveil recent strides in Auto Loader performance. Along the way, we’ll provide examples and best practices for optimizing cost and performance. Finally, we’ll introduce a preview of what’s coming next — including a REST API for pushing files directly to Delta, a UI for creating cloud storage pipelines and more. Join us to help shape the future of file ingestion on Databricks.

Machine Learning Model Deployment

2025-06-10

talk

AI/ML Data Lakehouse Databricks Delta Python Scikit-learn

This course is designed to introduce three primary machine learning deployment strategies and illustrate the implementation of each strategy on Databricks. Following an exploration of the fundamentals of model deployment, the course delves into batch inference, offering hands-on demonstrations and labs for utilizing a model in batch inference scenarios, along with considerations for performance optimization. The second part of the course comprehensively covers pipeline deployment, while the final segment focuses on real-time deployment. Participants will engage in hands-on demonstrations and labs, deploying models with Model Serving and utilizing the serving endpoint for real-time inference. By mastering deployment strategies for a variety of use cases, learners will gain the practical skills needed to move machine learning models from experimentation to production. This course shows you how to operationalize AI solutions efficiently, whether it's automating decisions in real-time or integrating intelligent insights into data pipelines. Pre-requisites: Familiarity with Databricks workspace and notebooks, familiarity with Delta Lake and Lakehouse, intermediate level knowledge of Python (e.g. common Python libraries for DS/ML like Scikit-Learn, awareness of model deployment strategies) Labs: Yes Certification Path: Databricks Certified Machine Learning Associate

State Street Uses Databricks as a Cybersecurity Lakehouse for Threat Intelligence & Real-Time Alerts

2025-06-10 Watch

talk

Paul Signorelli (Databricks) , Ajish George (State Street)

Data Lakehouse Databricks Cyber Security

Organizations face the challenge of managing vast amounts of data to combat emerging threats. The Databricks Data Intelligence platform represents a paradigm shift in cybersecurity at State Street, providing a comprehensive solution for managing and analyzing diverse security data. Through its partnership with Databricks, State Street has created a capability to: Efficiently manage structured and unstructured data. Scale up to analyze 50 petabytes of data in real-time. Ingest and parse data for critical security data streams. Build advanced cybersecurity data products and use automation & orchestration to streamline cybersecurity operations. By leveraging these capabilities, State Street has positioned itself as a leader in the financial services industry when it comes to cybersecurity.

Transforming Credit Analytics With a Compliant Lakehouse at Rabobank

2025-06-10 Watch

talk

Taras Chaikovskyi (Databricks) , Floris Hendriks (Rabobank)

Analytics Data Lakehouse Hive

This presentation outlines Rabobank Credit analytics transition to a secure, audit-ready data architecture using Unity Catalog (UC), addressing critical regulatory challenges in credit analytics for IRB and IFRS9 regulatory modeling. Key technical challenges included legacy infrastructure (Hive metastore, ADLS mounts using Service Principals and Credential passthrough) lacking granular access controls, data access auditing and limited visibility into lineage, creating governance and compliance gaps. Details cover a framework for phased migration to UC. Outcomes include data lineage mapping demonstrating compliance with regulatory requirements, granular role based access control and unified audit trails. Next steps involve a lineage visualization toolkit (custom app for impact analysis and reporting) and lineage expansion to incorporate upstream banking systems.

talk-data.com

Top Topics

Top Speakers

Swimming at Our Own Lakehouse: How Databricks Uses Databricks

Trust You Can Measure: Data Quality Standards in The Lakehouse

The JLL Training and Upskill Program for Our Warehouse Migration to Databricks

Accelerating Model Development and Fine-Tuning on Databricks with TwelveLabs

Petabyte-Scale On-Chain Insights: Real-Time Intelligence for the Next-Gen Financial Backbone

AI and Genie: Analyzing Healthcare Improvement Opportunities

Apache Iceberg with Unity Catalog at HelloFresh

Crafting Business Brilliance: Leveraging Databricks SQL for Next-Gen Applications

Reimagining Data Governance and Access at Atlassian

Revolutionizing Cybersecurity: SCB's Journey to a Self-Managed SIEM

Sponsored by: Accenture & Avanade | Enterprise Data Journey for The Standard Insurance Leveraging Databricks on Azure and AI Innovation

Sponsored by: Domo | Behind the Brand: How Sol de Janeiro Powers Amazon Ops with Databricks + DOMO

Data Modeling 101 for Data Lakehouse Demystified

Delta Lake and the Data Mesh

Migrating Legacy SAS Code to Databricks Lakehouse: What We Learned Along the Way

Simplifying Training and GenAI Finetuning Using Serverless GPU Compute

Sponsored by: Cognizant | Toyota Utilizes a Unified Lakehouse Approach with Databricks

Unify Your Data and Governance With Lakehouse Federation

Advanced RAG Overview — Thawing Your Frozen RAG Pipeline

Empowering Healthcare Insights: A Unified Lakehouse Approach With Databricks

How an Open, Scalable and Secure Data Platform is Powering Quick Commerce Swiggy's AI

Lakeflow Connect: Smarter, Simpler File Ingestion With the Next Generation of Auto Loader

Machine Learning Model Deployment

State Street Uses Databricks as a Cybersecurity Lakehouse for Threat Intelligence & Real-Time Alerts

Transforming Credit Analytics With a Compliant Lakehouse at Rabobank