talk-data.com talk-data.com

Topic

Data Quality

data_management data_cleansing data_validation

37

tagged

Activity Trend

82 peak/qtr
2020-Q1 2026-Q1

Activities

Showing filtered results

Filtering by: Data + AI Summit 2025 ×
Lakeflow Observability: From UI Monitoring to Deep Analytics

Monitoring data pipelines is key to reliability at scale. In this session, we’ll dive into the observability experience in Lakeflow, Databricks’ unified DE solution — from intuitive UI monitoring to advanced event analysis, cost observability and custom dashboards. We’ll walk through the revamped UX for Lakeflow observability, showing how to: Monitor runs and task states, dependencies and retry behavior in the UI Set up alerts for job and pipeline outcomes + failures Use pipeline and job system tables for historical insights Explore run events and event logs for root cause analysis Analyze metadata to understand and optimize pipeline spend How to build custom dashboards using system tables to track performance data quality, freshness, SLAs and failure trends, and drive automated alerting based on real-time signals This session will help you unlock full visibility into your data workflows.

Health Data, Delivered: How Lakeflow Declarative Pipelines Powers the HealthVerity Marketplace

Building scalable, reliable ETL pipelines is a challenge for organizations managing large, diverse data sources. Theseus, our custom ETL framework, streamlines data ingestion and transformation by fully leveraging Databricks-native capabilities, including Lakeflow Declarative Pipelines, auto loader and event-driven orchestration. By decoupling supplier logic and implementing structured bronze, silver, and gold layers, Theseus ensures high-performance, fault-tolerant data processing with minimal operational overhead. The result? Faster time-to-value, simplified governance and improved data quality — all within a declarative framework that reduces engineering effort. In this session, we’ll explore how Theseus automates complex data workflows, optimizes cost efficiency and enhances scalability, showcasing how Databricks-native tools drive real business outcomes.

Automating Engineering with AI - LLMs in Metadata Driven Frameworks

The demand for data engineering keeps growing, but data teams are bored by repetitive tasks, stumped by growing complexity and endlessly harassed by an unrelenting need for speed. What if AI could take the heavy lifting off your hands? What if we make the move away from code-generation and into config-generation — how much more could we achieve? In this session, we’ll explore how AI is revolutionizing data engineering, turning pain points into innovation. Whether you’re grappling with manual schema generation or struggling to ensure data quality, this session offers practical solutions to help you work smarter, not harder. You’ll walk away with a good idea of where AI is going to disrupt the data engineering workload, some good tips around how to accelerate your own workflows and an impending sense of doom around the future of the industry!

Sponsored by: Anomalo | Reconciling IoT, Policy, and Insurer Data to Deliver Better Customer Discounts

As insurers increasingly leverage IoT data to personalize policy pricing, reconciling disparate datasets across devices, policies, and insurers becomes mission-critical. In this session, learn how Nationwide transitioned from prototype workflows in Dataiku to a hardened data stack on Databricks, enabling scalable data governance and high-impact analytics. Discover how the team orchestrates data reconciliation across Postgres, Oracle, and Databricks to align customer driving behavior with insurer and policy data—ensuring more accurate, fair discounts for policyholders. With Anomalo’s automated monitoring layered on top, Nationwide ensures data quality at scale while empowering business units to define custom logic for proactive stewardship. We’ll also look ahead to how these foundations are preparing the enterprise for unstructured data and GenAI initiatives.

Sponsored by: Soda Data Inc. | Clean Energy, Clean Data: How Data Quality Powers Decarbonization

Drawing on BDO Canada’s deep expertise in the electricity sector, this session explores how clean energy innovation can be accelerated through a holistic approach to data quality. Discover BDO’s practical framework for implementing data quality and rebuilding trust in data through a structured, scalable approach. BDO will share a real-world example of monitoring data at scale—from high-level executive dashboards to the details of daily ETL and ELT pipelines. Learn how they leveraged Soda’s data observability platform to unlock near-instant insights, and how they moved beyond legacy validation pipelines with built-in checks across their production Lakehouse. Whether you're a business leader defining data strategy or a data engineer building robust data products, this talk connects the strategic value of clean data with actionable techniques to make it a reality.

How do you transform a data pipeline from sluggish 10-hour batch processing into a real-time powerhouse that delivers insights in just 10 minutes? This was the challenge we tackled at one of France's largest manufacturing companies, where data integration and analytics were mission-critical for supply chain optimization. Power BI dashboards needed to refresh every 15 minutes. Our team struggled with legacy Azure Data Factory batch pipelines. These outdated processes couldn’t keep up, delaying insights and generating up to three daily incident tickets. We identified Lakeflow Declarative Pipelines and Databricks SQL as the game-changing solution to modernize our workflow, implement quality checks, and reduce processing times.In this session, we’ll dive into the key factors behind our success: Pipeline modernization with Lakeflow Declarative Pipelines: improving scalability Data quality enforcement: clean, reliable datasets Seamless BI integration: Using Databricks SQL to power fast, efficient queries in Power BI

Lakeflow in Production: CI/CD, Testing and Monitoring at Scale

Building robust, production-grade data pipelines goes beyond writing transformation logic — it requires rigorous testing, version control, automated CI/CD workflows and a clear separation between development and production. In this talk, we’ll demonstrate how Lakeflow, paired with Databricks Asset Bundles (DABs), enables Git-based workflows, automated deployments and comprehensive testing for data engineering projects. We’ll share best practices for unit testing, CI/CD automation, data quality monitoring and environment-specific configurations. Additionally, we’ll explore observability techniques and performance tuning to ensure your pipelines are scalable, maintainable and production-ready.

Sponsored by: Oxylabs | Web Scraping and AI: A Quiet but Critical Partnership

Behind every powerful AI system lies a critical foundation: fresh, high-quality web data. This session explores the symbiotic relationship between web scraping and artificial intelligence that's transforming how technical teams build data-intensive applications. We'll showcase how this partnership enables crucial use cases: analyzing trends, forecasting behaviors, and enhancing AI models with real-time information. Technical challenges that once made web scraping prohibitively complex are now being solved through the very AI systems they help create. You'll learn how machine learning revolutionizes web data collection, making previously impossible scraping projects both feasible and maintainable, while dramatically reducing engineering overhead and improving data quality. Join us to explore this quiet but critical partnership that's powering the next generation of AI applications.

How the Texas Rangers Use a Unified Data Platform to Drive World Class Baseball Analytics

Don't miss this session where we demonstrate how the Texas Rangers baseball team is staying one step ahead of the competition by going back to the basics. After implementing a modern data strategy with Databricks and winnng the 2023 World Series the rest of the league quickly followed suit. Now more than ever, data and AI are a central pillar of every baseball team's strategy driving profound insights into player performance and game dynamics. With a 'fundamentals win games' back to the basics focus, join us as we explain our commmitment to world-class data quality, engineering, and MLOPS by taking full advantage of the Databricks Data Intelligence Platform. From system tables to federated querying, find out how the Rangers use every tool at their disposal to stay one step ahead in the hyper competitive world of baseball.

Sponsored by: KPMG | Enhancing Regulatory Compliance through Data Quality and Traceability

In highly regulated industries like financial services, maintaining data quality is an ongoing challenge. Reactive measures often fail to prevent regulatory penalties, causing inaccuracies in reporting and inefficiencies due to poor data visibility. Regulators closely examine the origins and accuracy of reporting calculations to ensure compliance. A robust system for data quality and lineage is crucial. Organizations are utilizing Databricks to proactively improve data quality through rules-based and AI/ML-driven methods. This fosters complete visibility across IT, data management, and business operations, facilitating rapid issue resolution and continuous data quality enhancement. The outcome is quicker, more accurate, transparent financial reporting. We will detail a framework for data observability and offer practical examples of implementing quality checks throughout the data lifecycle, specifically focusing on creating data pipelines for regulatory reporting,

Optimizing Analytics Infrastructure: Lessons from Migrating Snowflake to Databricks

This session explores the strategic migration from Snowflake to Databricks, focusing on the journey of transforming a data lake to leverage Databricks’ advanced capabilities. It outlines the assessment of key architectural differences, performance benchmarks, and cost implications driving the decision. Attendees will gain insights into planning and execution, including data ingestion pipelines, schema conversion and metadata migration. Challenges such as maintaining data quality, optimizing compute resources and minimizing downtime are discussed, alongside solutions implemented to ensure a seamless transition. The session highlights the benefits of unified analytics and enhanced scalability achieved through Databricks, delivering actionable takeaways for similar migrations.

Fueling Efficiency: How Pilot Uses Vector Stores, Data Quality, and GenAI to Deliver Business Value

In the complex world of logistics, efficiency and accuracy are paramount. At Pilot, the largest travel center network in North America, managing fuel delivery operations was a time-intensive and error-prone process. Tasks like processing delivery records and validating fuel transaction data posed significant challenges due to the diverse formats and handwritten elements involved. After several attempts to use robotic process automation failed, the team turned to Generative AI to automate and streamline this critical business process. In this session, discover how Pilot leverages GenAI, powered by advanced text and vision models, to revolutionize BOL processing. By implementing few-shot learning and vectorized examples, the data team at Pilot was able to increase document parsing accuracy from 70% to 95%, enabling real-time validation against truck driver inputs, which has resulted in millions of savings from accelerating credit reconciliation and improved financial operations.

Build AI-Powered Applications Natively on Databricks

Discover how to build and deploy AI-powered applications natively on the Databricks Data Intelligence Platform. This session introduces best practices and a standard reference architecture for developing production-ready apps using popular frameworks like Dash, Shiny, Gradio, Streamlit and Flask. Learn how to leverage agents for orchestration and explore primary use cases supported by Databricks Apps, including data visualization, AI applications, self-service analytics and data quality monitoring. With serverless deployment and built-in governance through Unity Catalog, Databricks Apps enables seamless integration with your data and AI models, allowing you to focus on delivering impactful solutions without the complexities of infrastructure management. Whether you're a data engineer or an app developer, this session will equip you with the knowledge to create secure, scalable and efficient applications within a Databricks environment.

Busting Data Modeling Myths: Truths and Best Practices for Data Modeling in the Lakehouse

Unlock the truth behind data modeling in Databricks. This session will tackle the top 10 myths surrounding relational and dimensional data modeling. Attendees will gain a clear understanding of what Databricks Lakehouse truly supports today, including how to leverage primary and foreign keys, identity columns for surrogate keys, column-level data quality constraints and much more. This session will talk through the lens of medallion architecture, explaining how to implement data models across bronze, silver, and gold tables. Whether you’re migrating from a legacy warehouse or building new analytics solutions, you’ll leave equipped to fully leverage Databricks’ capabilities, and design scalable, high-performance data models for enterprise analytics.

Sponsored by: Informatica | Modernize analytics and empower AI in Databricks with trusted data using Informatica

As enterprises continue their journey to the cloud, data warehouse and data management modernization is essential to optimize analytics and drive business outcomes. Minimizing modernization timelines is important for reducing risk and shortening time to value – and ensuring enterprise data is clean, curated and governed is imperative to enable analytics and AI initiatives. In this session, learn how Informatica's Intelligent Data Management Cloud (IDMC) empowers analytics and AI on Databricks by helping data teams: · Develop no-code/low-code data pipelines that ingest, transform and clean data at enterprise scale · Improve data quality and extend enterprise governance with Informatica Cloud Data Governance and Catalog (CDGC) and Unity Catalog · Accelerate pilot-to-production with Mosaic AI

The Upcoming Apache Spark 4.1: The Next Chapter in Unified Analytics

Apache Spark has long been recognized as the leading open-source unified analytics engine, combining a simple yet powerful API with a rich ecosystem and top-notch performance. In the upcoming Spark 4.1 release, the community reimagines Spark to excel at both massive cluster deployments and local laptop development. We’ll start with new single-node optimizations that make PySpark even more efficient for smaller datasets. Next, we’ll delve into a major “Pythonizing” overhaul — simpler installation, clearer error messages and Pythonic APIs. On the ETL side, we’ll explore greater data source flexibility (including the simplified Python Data Source API) and a thriving UDF ecosystem. We’ll also highlight enhanced support for real-time use cases, built-in data quality checks and the expanding Spark Connect ecosystem — bridging local workflows with fully distributed execution. Don’t miss this chance to see Spark’s next chapter!

Accelerating Data Transformation: Best Practices for Governance, Agility and Innovation

In this session, we will share NCS’s approach to implementing a Databricks Lakehouse architecture, focusing on key lessons learned and best practices from our recent implementations. By integrating Databricks SQL Warehouse, the DBT Transform framework and our innovative test automation framework, we’ve optimized performance and scalability, while ensuring data quality. We’ll dive into how Unity Catalog enabled robust data governance, empowering business units with self-serve analytical workspaces to create insights while maintaining control. Through the use of solution accelerators, rapid environment deployment and pattern-driven ELT frameworks, we’ve fast-tracked time-to-value and fostered a culture of innovation. Attendees will gain valuable insights into accelerating data transformation, governance and scaling analytics with Databricks.

Achieving AI Success with a Solid Data Foundation

Join for an insightful presentation on creating a robust data architecture to drive business outcomes in the age of Generative AI. Santosh Kudva, GE Vernova Chief Data Officer and Kevin Tollison, EY AI Consulting Partner, will share their expertise on transforming data strategies to unleash the full potential of AI. Learn how GE Vernova, a dynamic enterprise born from the 2024 spin-off of GE, revamped its diverse landscape. They will provide a look into how they integrated the pre-spin-off Finance Data Platform into the GE Vernova Enterprise Data & Analytics ecosystem utilizing Databricks to enable high-performance AI-led analytics. Key insights include: Incorporating Generative AI into your overarching strategy Leveraging comprehensive analytics to enhance data quality Building a resilient data framework adaptable to continuous evolution Don't miss this opportunity to hear from industry leaders and gain valuable insights to elevate your data strategy and AI success.

Enterprise Financial Crime Detection: A Lakehouse Framework for FATF, Basel III, and BSA Compliance

We will present a framework for FinCrime detection leveraging Databricks lakehouse architecture specifically how institutions can achieve both data flexibility & ACID transaction guarantees essential for FinCrime monitoring. The framework incorporates advanced ML models for anomaly detection, pattern recognition, and predictive analytics, while maintaining clear data lineage & audit trails required by regulatory bodies. We will also discuss some specific improvements in reduction of false positives, improvement in detection speed, and faster regulatory reporting, delve deep into how the architecture addresses specific FATF recommendations, Basel III risk management requirements, and BSA compliance obligations, particularly in transaction monitoring and SAR. The ability to handle structured and unstructured data while maintaining data quality and governance makes it particularly valuable for large financial institutions dealing with complex, multi-jurisdictional compliance requirements.