talk-data.com talk-data.com

Topic

Data Lakehouse

data_architecture data_warehouse data_lake

489

tagged

Activity Trend

118 peak/qtr
2020-Q1 2026-Q1

Activities

489 activities · Newest first

Unified Governance and Enterprise Sharing for Data + AI

The Databricks Lakehouse for Public Sector is the only enterprise data platform that allows you to leverage all your data, from any source, on any workload to always offer better citizen services/warfighter support/student success with the best outcomes, at the lowest cost, with the greatest investment protection.

What’s new with Collaboration: Delta Sharing, Clean Room, Marketplace and the Ecosystem

Databricks continues to redefine how organizations securely and openly collaborate on data. With new innovations like Clean Rooms for multi-party collaboration, Sharing for Lakehouse Federation, cross-platform view sharing and Databricks Apps in the Marketplace, teams can now share and access data more easily, cost-effectively and across platforms — whether or not they’re using Databricks. In this session, we’ll deliver live demos of key capabilities that power this transformation: Delta Sharing: The industry’s only open protocol for seamless cross-platform data sharing Databricks Marketplace: A central hub for discovering and monetizing data and AI assets Clean Rooms: A privacy-preserving solution for secure, multi-party data collaboration Join us to see how these tools enable trusted data sharing, accelerate insights and drive innovation across your ecosystem. Bring your questions and walk away with practical ways to put these capabilities into action today.

Summary In this episode of the Data Engineering Podcast Alex Albu, tech lead for AI initiatives at Starburst, talks about integrating AI workloads with the lakehouse architecture. From his software engineering roots to leading data engineering efforts, Alex shares insights on enhancing Starburst's platform to support AI applications, including an AI agent for data exploration and using AI for metadata enrichment and workload optimization. He discusses the challenges of integrating AI with data systems, innovations like SQL functions for AI tasks and vector databases, and the limitations of traditional architectures in handling AI workloads. Alex also shares his vision for the future of Starburst, including support for new data formats and AI-driven data exploration tools.

Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementData migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details.This is a pharmaceutical Ad for Soda Data Quality. Do you suffer from chronic dashboard distrust? Are broken pipelines and silent schema changes wreaking havoc on your analytics? You may be experiencing symptoms of Undiagnosed Data Quality Syndrome — also known as UDQS. Ask your data team about Soda. With Soda Metrics Observability, you can track the health of your KPIs and metrics across the business — automatically detecting anomalies before your CEO does. It’s 70% more accurate than industry benchmarks, and the fastest in the category, analyzing 1.1 billion rows in just 64 seconds. And with Collaborative Data Contracts, engineers and business can finally agree on what “done” looks like — so you can stop fighting over column names, and start trusting your data again.Whether you’re a data engineer, analytics lead, or just someone who cries when a dashboard flatlines, Soda may be right for you. Side effects of implementing Soda may include: Increased trust in your metrics, reduced late-night Slack emergencies, spontaneous high-fives across departments, fewer meetings and less back-and-forth with business stakeholders, and in rare cases, a newfound love of data. Sign up today to get a chance to win a $1000+ custom mechanical keyboard. Visit dataengineeringpodcast.com/soda to sign up and follow Soda’s launch week. It starts June 9th. This episode is brought to you by Coresignal, your go-to source for high-quality public web data to power best-in-class AI products. Instead of spending time collecting, cleaning, and enriching data in-house, use ready-made multi-source B2B data that can be smoothly integrated into your systems via APIs or as datasets. With over 3 billion data records from 15+ online sources, Coresignal delivers high-quality data on companies, employees, and jobs. It is powering decision-making for more than 700 companies across AI, investment, HR tech, sales tech, and market intelligence industries. A founding member of the Ethical Web Data Collection Initiative, Coresignal stands out not only for its data quality but also for its commitment to responsible data collection practices. Recognized as the top data provider by Datarade for two consecutive years, Coresignal is the go-to partner for those who need fresh, accurate, and ethically sourced B2B data at scale. Discover how Coresignal's data can enhance your AI platforms. Visit dataengineeringpodcast.com/coresignal to start your free 14-day trial.Your host is Tobias Macey and today I'm interviewing Alex Albu about how Starburst is extending the lakehouse to support AI workloadsInterview IntroductionHow did you get involved in the area of data management?Can you start by outlining the interaction points of AI with the types of data workflows that you are supporting with Starburst?What are some of the limitations of warehouse and lakehouse systems when it comes to supporting AI systems?What are the points of friction for engineers who are trying to employ LLMs in the work of maintaining a lakehouse environment?Methods such as tool use (exemplified by MCP) are a means of bolting on AI models to systems like Trino. What are some of the ways that is insufficient or cumbersome?Can you describe the technical implementation of the AI-oriented features that you have incorporated into the Starburst platform?What are the foundational architectural modifications that you had to make to enable those capabilities?For the vector storage and indexing, what modifications did you have to make to iceberg?What was your reasoning for not using a format like Lance?For teams who are using Starburst and your new AI features, what are some examples of the workflows that they can expect?What new capabilities are enabled by virtue of embedding AI features into the interface to the lakehouse?What are the most interesting, innovative, or unexpected ways that you have seen Starburst AI features used?What are the most interesting, unexpected, or challenging lessons that you have learned while working on AI features for Starburst?When is Starburst/lakehouse the wrong choice for a given AI use case?What do you have planned for the future of AI on Starburst?Contact Info LinkedInParting Question From your perspective, what is the biggest gap in the tooling or technology for data management today?Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story.Links StarburstPodcast EpisodeAWS AthenaMCP == Model Context ProtocolLLM Tool UseVector EmbeddingsRAG == Retrieval Augmented GenerationAI Engineering Podcast EpisodeStarburst Data ProductsLanceLanceDBParquetORCpgvectorStarburst IcehouseThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Enterprise Financial Crime Detection: A Lakehouse Framework for FATF, Basel III, and BSA Compliance

We will present a framework for FinCrime detection leveraging Databricks lakehouse architecture specifically how institutions can achieve both data flexibility & ACID transaction guarantees essential for FinCrime monitoring. The framework incorporates advanced ML models for anomaly detection, pattern recognition, and predictive analytics, while maintaining clear data lineage & audit trails required by regulatory bodies. We will also discuss some specific improvements in reduction of false positives, improvement in detection speed, and faster regulatory reporting, delve deep into how the architecture addresses specific FATF recommendations, Basel III risk management requirements, and BSA compliance obligations, particularly in transaction monitoring and SAR. The ability to handle structured and unstructured data while maintaining data quality and governance makes it particularly valuable for large financial institutions dealing with complex, multi-jurisdictional compliance requirements.

Sponsored by: Onehouse | Open By Default, Fast By Design: One Lakehouse That Scales From BI to AI

You already see the value of the lakehouse. But are you truly maximizing its potential across all workloads, from BI to AI? In this session, Onehouse unveils how our open lakehouse architecture unifies your entire stack, enabling true interoperability across formats, catalogs, and engines. From lightning-fast ingestion at scale to cost-efficient processing and multi-catalog sync, Onehouse helps you go beyond trade-offs. Discover how Apache XTable (Incubating) enables cross-table-format compatibility, how OpenEngines puts your data in front of the best engine for the job, and how OneSync keeps data consistent across Snowflake, Athena, Redshift, BigQuery, and more. Meanwhile, our purpose-built lakehouse runtime slashes ingest and ETL costs. Whether you’re delivering BI, scaling AI, or building the next big thing, you need a lakehouse that’s open and powerful. Onehouse opens everything—so your data can power anything.

Zillow has well-established, comprehensive systems for defining and enforcing data quality contracts and detecting anomalies.In this session, we will share how we evaluated Databricks’ native data quality features and why we chose Lakeflow Declarative Pipelines expectations for Lakeflow Declarative Pipelines, along with a combination of enforced constraints and self-defined queries for other job types. Our evaluation considered factors such as performance overhead, cost and scalability. We’ll highlight key improvements over our previous system and demonstrate how these choices have enabled Zillow to enforce scalable, production-grade data quality.Additionally, we are actively testing Databricks’ latest data quality innovations, including enhancements to lakehouse monitoring and the newly released DQX project from Databricks Labs.In summary, we will cover Zillow’s approach to data quality in the lakehouse, key lessons from our migration and actionable takeaways.

Sponsored by: Actian | Beyond the Lakehouse: Unlocking Enterprise-Wide AI-Ready Data with Unified Metadata Intelligence

As organizations scale AI initiatives on platforms like Databricks, one challenge remains: bridging the gap between the data in the lakehouse and the vast, distributed data that lives elsewhere. Turning massive volumes of technical metadata into trusted, business-ready insight requires more than cataloging what's inside the lakehouse—it demands true enterprise-wide intelligence. Actian CTO Emma McGrattan will explore how combining Databricks Unity Catalog with the Actian Data Platform extends visibility, governance, and trust beyond the lakehouse. Learn how leading enterprises are: Integrating metadata across all enterprise data assets for complete visibility Enriching Unity Catalog metadata with business context for broader usability Empowering non-technical users to discover, trust, and act on AI-ready data Building a foundation for scalable data productization with governance by design

Sponsored by: Slalom | Nasdaq's Journey from Fragmented Customer Data to AI-Ready Insights

Nasdaq’s rapid growth through acquisitions led to fragmented client data across multiple Salesforce instances, limiting cross-sell potential and sales insights. To solve this, Nasdaq partnered with Slalom to build a unified Client Data Hub on the Databricks Lakehouse Platform. This cloud-based solution merges CRM, product usage, and financial data into a consistent, 360° client view accessible across all Salesforce orgs with bi-directional integration. It enables personalized engagement, targeted campaigns, and stronger cross-sell opportunities across all business units. By delivering this 360 view directly in Salesforce, Nasdaq is improving sales visibility, client satisfaction, and revenue growth. The platform also enables advanced analytics like segmentation, churn prediction, and revenue optimization. With centralized data in Databricks, Nasdaq is now positioned to deploy next-gen Agentic AI and chatbots to drive efficiency and enhance sales and marketing experiences.

Unlocking the Power of Iceberg: Our Journey to a Unified Lakehouse on Databricks

This session showcases our journey of adopting Apache Iceberg™ to build a modern lakehouse architecture and leveraging Databricks advanced Iceberg support to take it to the next level. We’ll dive into the key design principles behind our lakehouse, the operational challenges we tackled and how Databricks enabled us to unlock enhanced performance, scalability and streamlined data workflows. Whether you’re exploring Apache Iceberg™ or building a lakehouse on Databricks, this session offers actionable insights, lessons learned and best practices for modern data engineering.

Databricks on Databricks: Powering Marketing Insights with Lakehouse

This presentation outlines the evolution of our marketing data strategy, focusing on how we’ve built a strong foundation using the Databricks Lakehouse. We will explore key advancements across data ingestion, strategy, and insights, highlighting the transition from legacy systems to a more scalable and intelligent infrastructure. Through real-world applications, we will showcase how unified Customer 360 insights drive personalization, predictive analytics enhance campaign effectiveness, and GenAI optimizes content creation and marketing execution. Looking ahead, we will demonstrate the next phase of our CDP, the shift toward an end-user-first analytics model powered by AIBI, Genie and Matik, and the growing importance of clean rooms for secure data collaboration. This is just the beginning, and we are poised to unlock even greater capabilities in the future.

Understanding customer engagement and retention isn’t optional—it’s mission-critical. Join us for a live demo to see how you can build a scalable, governed customer health scoring model by transforming raw signals into actionable insights. Discover how Coalesce’s low-code development platform works seamlessly with Databricks’ lakehouse architecture to unify and operationalize customer data at scale. With built-in governance, automation, and metadata intelligence, you’ll deliver trusted scores that support proactive decision-making across the business. Why Attend? Accelerate time-to-insight with automated, low-code transformations Build repeatable, enterprise-grade scoring models with full data lineage Ensure governance, transparency, and compliance at every step

Kernel, Catalog, Action! Reimagining our Delta-Spark Connector with DSv2

Delta Lake is redesigning its Spark connector through the combination of three key technologies: First, we're updating our Spark APIs to DSv2 to achieve deeper catalog integration and improved integration with the Spark optimizer. Second, we're fully integrating on top of Delta Kernel to take advantage of its simplified abstraction of Delta protocol complexities, accelerating feature adoption and improving maintainability. Third, we are transforming Delta to become a catalog-aware lakehouse format with Catalog Commits, enabling more efficient metadata management, governance and query performance. Join us to explore how we're advancing Delta Lake's architecture, pushing the boundaries of metadata management and creating a more intelligent, performant data lakehouse platform.

AI-Driven Drug Discovery: Accelerating Molecular Insights With NVIDIA and Databricks

This session is repeated. In the race to revolutionize healthcare and drug discovery, biopharma companies are turning to AI to streamline workflows and unlock new scientific insights. This session, we will explore how NVIDIA BioNeMo, combined with Databricks Delta Lakehouse, can be used for advancing drug discovery for critical applications like molecular structure modeling, protein folding and diagnostics. We’ll demonstrate how BioNeMo pre-trained models can run inference on data securely stored in Delta Lake, delivering actionable insights. By leveraging containerized solutions on Databricks’ ML Runtime with GPU acceleration, users can achieve significant performance gains compared to traditional CPU-based computation.

AI Powering Epsilon's Identity Strategy: Unified Marketing Platform on Databricks

Join us to hear about how Epsilon Data Management migrated Epsilon’s unique, AI-powered marketing identity solution from multi-petabyte on-prem Hadoop and data warehouse systems to a unified Databricks Lakehouse platform. This transition enabled Epsilon to further scale its Decision Sciences solution and enable new cloud-based AI research capabilities on time and within budget, without being bottlenecked by the resource constraints of on-prem systems. Learn how Delta Lake, Unity Catalog, MLflow and LLM endpoints powered massive data volume, reduced data duplication, improved lineage visibility, accelerated Data Science and AI, and enabled new data to be immediately available for consumption by the entire Epsilon platform in a privacy-safe way. Using the Databricks platform as the base for AI and Data Science at global internet scale, Epsilon deploys marketing solutions across multiple cloud providers and multiple regions for many customers.

From Days to Seconds — Reducing Query Times on Large Geospatial Datasets by 99%

The Global Water Security Center translates environmental science into actionable insights for the U.S. Department of Defense. Prior to incorporating Databricks, responding to these requests required querying approximately five hundred thousand raster files representing over five hundred billion points. By leveraging lakehouse architecture, Databricks Auto Loader, Spark Streaming, Databricks Spatial SQL, H3 geospatial indexing and Databricks Liquid Clustering, we were able to drastically reduce our “time to analysis” from multiple business days to a matter of seconds. Now, our data scientists execute queries on pre-computed tables in Databricks, resulting in a “time to analysis” that is 99% faster, giving our teams more time for deeper analysis of the data. Additionally, we’ve incorporated Databricks Workflows, Databricks Asset Bundles, Git and Git Actions to support CI/CD across workspaces. We completed this work in close partnership with Databricks.

Geo-Powering Insights: The Art of Spatial Data Integration and Visualization

In this presentation, we will explore how to leverage Databricks' SQL engine to efficiently ingest and transform geospatial data. We'll demonstrate the seamless process of connecting to external systems such as ArcGIS to retrieve datasets, showcasing the platform's versatility in handling diverse data sources. We'll then delve into the power of Databricks Apps, illustrating how you can create custom geospatial dashboards using various frameworks like Streamlit and Flask, or any framework of your choice. This flexibility allows you to tailor your visualizations to your specific needs and preferences. Furthermore, we'll highlight the Databricks Lakehouse's integration capabilities with popular dashboarding tools such as Tableau and Power BI. This integration enables you to combine the robust data processing power of Databricks with the advanced visualization features of these specialized tools.

Unity Catalog Upgrades Made Easy. Step-by-Step Guide for Databricks Labs UCX

The Databricks labs project UCX aims to optimize the Unity Catalog (UC) upgrade process, ensuring a seamless transition for businesses. This session will delve into various aspects of the UCX project including the installation and configuration of UCX, the use of the UCX Assessment Dashboard to reduce upgrade risks and prepare effectively for a UC upgrade, and the automation of key components such as group, table and code migration. Attendees will gain comprehensive insights into leveraging UCX and Lakehouse Federation for a streamlined and efficient upgrade process. This session is aimed at customers new to UCX as well as veterans.

Gaining Insight From Image Data in Databricks Using Multi-Modal Foundation Model API

Unlock the hidden potential in your image data without specialized computer vision expertise! This session explores how to leverage Databricks' multi-modal Foundation Model APIs to analyze, classify and extract insights from visual content. Learn how Databricks provides a unified API to understand images using powerful foundation models within your data workflows. Key takeaways: Implementing efficient workflows for image data processing within your Databricks lakehouse Understanding multi-modal foundation models for image understanding Integrating image analysis with other data types for business insights Using OpenAI-compatible APIs to query multi-modal models Building end-to-end pipelines from image ingestion to model deployment Whether analyzing product images, processing visual documents or building content moderation systems, you'll discover how to extract valuable insights from your image data within the Databricks ecosystem.

Powering Personalization at Scale with Data: How T-Mobile and Deep Sync Help Brands Connect with Consumers

Discover how T-Mobile and Deep Sync are redefining personalized marketing through the power of Databricks. Deep Sync, a leader in deterministic identity solutions, has brought its identity spine to Databricks Lakehouse, which covers over 97% of U.S. households with the most current and accurate attribute data available. T-Mobile is bringing to market for the first time a new data services business that introduces privacy-compliant, consent-based consumer data. Together, T-Mobile and Deep Sync are transforming how brands engage with consumers—enabling bespoke, hyper-personalized workflows, identity-driven insights, and closed-loop measurement through Databricks’ Multi-Party Cleanrooms. Join this session to learn how data and identity are converging to solve today’s modern marketing challenges so consumers can rediscover what it feels like to be seen, not targeted