talk-data.com talk-data.com

Topic

Data Quality

data_management data_cleansing data_validation

116

tagged

Activity Trend

82 peak/qtr
2020-Q1 2026-Q1

Activities

116 activities · Newest first

AWS re:Invent 2025 - Customize AI models & accelerate time to production with Amazon SageMaker AI

Customizing models often requires lengthy iteration cycles. Now with Amazon SageMaker AI, you can accelerate the model customization process from months to days. With an easy-to-use interface, you can quickly get started and customize popular models with your own data, including Amazon Nova, Llama, Qwen, DeepSeek, and GPT-OSS, with the latest customization techniques such as reinforcement learning and direct preference optimization. In addition, with the AI agent-guided workflow (in preview), you can use natural language to generate synthetic data, analyze data quality, and handle model training and evaluation—all entirely serverless. Join us to learn how you can accelerate your model customization journey.

Learn more: More AWS events: https://go.aws/3kss9CP

Subscribe: More AWS videos: http://bit.ly/2O3zS75 More AWS events videos: http://bit.ly/316g9t4

ABOUT AWS: Amazon Web Services (AWS) hosts events, both online and in-person, bringing the cloud computing community together to connect, collaborate, and learn from AWS experts. AWS is the world's most comprehensive and broadly adopted cloud platform, offering over 200 fully featured services from data centers globally. Millions of customers—including the fastest-growing startups, largest enterprises, and leading government agencies—are using AWS to lower costs, become more agile, and innovate faster.

AWSreInvent #AWSreInvent2025 #AWS

AWS re:Invent 2025 - A practitioner’s guide to data for agentic AI (DAT315)

In this session, gain the skills needed to deploy end-to-end agentic AI applications using your most valuable data. This session focuses on data management using processes like Model Context Protocol (MCP) and Retrieval Augmented Generation (RAG), and provides concepts that apply to other methods of customizing agentic AI applications. Discover best practice architectures using AWS database services like Amazon Aurora and OpenSearch Service, along with analytical, data processing and streaming experiences found in SageMaker Unified Studio. Learn data lake, governance, and data quality concepts and how Amazon Bedrock AgentCore and Bedrock Knowledge Bases, and other features tie solution components together.

Learn more: More AWS events: https://go.aws/3kss9CP

Subscribe: More AWS videos: http://bit.ly/2O3zS75 More AWS events videos: http://bit.ly/316g9t4

ABOUT AWS: Amazon Web Services (AWS) hosts events, both online and in-person, bringing the cloud computing community together to connect, collaborate, and learn from AWS experts. AWS is the world's most comprehensive and broadly adopted cloud platform, offering over 200 fully featured services from data centers globally. Millions of customers—including the fastest-growing startups, largest enterprises, and leading government agencies—are using AWS to lower costs, become more agile, and innovate faster.

AWSreInvent #AWSreInvent2025 #AWS

A Women-Led Case Study in Applied Data Analytics with Mariah Marr & Michelle Sullivan

While data analytics is often viewed as a highly technical field, one of its most challenging aspects lies in identifying the right questions to ask. Beyond the expected skills of summarizing data, building visualizations, and generating insights, analysts must also bridge the gap between complex data and non-technical stakeholders.

This presentation features a case study led by two women from the Research and Data Analytics team at the Minnesota Department of Labor and Industry. It illustrates the end-to-end process of transforming raw data to create a fully developed dashboard that delivers actionable insights for the department’s Apprenticeship unit.

We will share key challenges encountered along the way, from handling issues of data quality and accessibility to adapting the tool for the differing needs and expectations of new stakeholders. Attendees will leave with actionable strategies for transforming messy datasets into clear, impactful dashboards that drive smarter decision making.

Operationalizing Responsible AI and Data Science in Healthcare with Nasibeh Zanirani Farahani

As healthcare organizations accelerate their adoption of AI and data-driven systems, the challenge lies not only in innovation but in responsibly scaling these technologies within clinical and operational workflows. This session examines the technical and governance frameworks required to translate AI research into reliable and compliant real-world applications. We will explore best practices in model lifecycle management, data quality assurance, bias detection, regulatory alignment, and human-in-the-loop validation, grounded in lessons from implementing AI solutions across complex healthcare environments. Emphasizing cross-functional collaboration among clinicians, data scientists, and business leaders, the session highlights how to balance technical rigor with clinical relevance and ethical accountability. Attendees will gain actionable insights into building trustworthy AI pipelines, integrating MLOps principles in regulated settings, and delivering measurable improvements in patient care, efficiency, and organizational learning.

Wrangling Internet-scale Image Datasets

Building and curating datasets at internet scale is both powerful and messy. At Irreverent Labs, we recently released Re-LAION-Caption19M, a 19-million–image dataset with improved captions, alongside a companion arXiv paper. Behind the scenes, the project involved wrangling terabytes of raw data and designing pipelines that could produce a research-quality dataset while remaining resilient, efficient, and reproducible. In this talk, we’ll share some of the practical lessons we learned while engineering data at this scale. Topics include: strategies for ensuring data quality through a mix of automated metrics and human inspection; why building file manifests pays off when dealing with millions of files; effective use of Parquet, WDS and JSONL for metadata and intermediate results; pipeline patterns that favor parallel processing and fault tolerance; and how logging and dashboards can turn long-running jobs from opaque into observable. Whether you’re working with images, text, or any other massive dataset, these patterns and pitfalls may help you design pipelines that are more robust, maintainable, and researcher-friendly.

Freedom through structure: How WHOOP scales analyst autonomy with dbt

AI and dbt unlocks the potential for any data analyst to work like full-stack dbt developers. But without the right guardrails, that freedom can quickly turn into chaos and technical debt. At WHOOP, we embraced analyst autonomy, and scaled it responsibly. In this session, you’ll learn how we empowered analysts to build in dbt while protecting data quality, staying aligned with the broader team, and avoiding technical debt. If you’re looking to give analysts more ownership without giving up control, this session will show you how to get there.

Towards a more perfect pipeline: CI/CD in the dbt Platform
talk
by Aaiden Witten (United Services Automobile Association) , Michael Sturm (United Services Automobile Association) , Timothy Shiveley (United Services Automobile Association)

In this session we’ll show how we integrated CI/CD dbt jobs to validate data and run tests on every merge request. Attendees will walk away with a blueprint for implementing CI/CD for dbt, lessons learned from our journey and best practices to keep data quality high without slowing down development.

Decisions you can count on: AI + dbt at DocuSign

AI is reshaping every stage of the analytics process. And at Docusign, that transformation is already underway. The data team is using AI to boost data quality, save engineers time, and deliver insights business users can actually trust. This session takes you end to end, from automated unit tests to governed metrics, showing how Docusign connects AI-driven development with self-serve analytics powered by the dbt Semantic Layer. The result: faster delivery, fewer surprises, and smarter decisions across the org.

Opening keynote: Rewrite
talk
by Ken Ostner (dbt Labs) , Jonny Reichwald (EQT) , Jerrie Kumalah Kenney (dbt Labs) , Faith McKenna (dbt Labs) , Tristan Handy (dbt Labs) , Øyvind Barsnes Eraker (Norges Bank Investment Management) , Elias DeFaria (dbt Labs) , George Fraser (Fivetran)

Let’s kick off Coalesce 2025 with a bang! At our opening keynote, you’ll hear how dbt is redefining data work in the age of AI. You’ll hear from dbt Labs CEO and Founder Tristan Handy for his perspective on how AI is rapidly pushing our industry forward and how the dbt Fusion engine is redefining the dbt standard. Tristan will also be joined by Fivetran CEO George Fraser as they discuss their shared vision of an open data infrastructure. dbt Labs product and technical leaders will also share the latest innovations in dbt including: how Fusion is making the development experience more efficient and intelligent than ever before, and our progress on embedded cost optimization and AI experiences designed to help our users expedite workflows while keeping costs and data quality in check. Throughout, dbt customers and partners will connect the dots between dbt’s vision and product roadmap with real-world customer outcomes. Get a front-row seat to the action. For our Coalesce Online attendees, join us on Slack in #coalesce-2025 to stay connected during keynote!

How to do real TDD in data science? A journey from pandas to polars with pelage!

In the world of data, inconsistencies or inaccuracies often presents a major challenge to extract valuable insights. Yet the number of robust tools and practices to address those issues remain limited. Particularly, the practice of TDD remains quite difficult in data science, while it is a standard among classic software development, also because of poorly adapted tools and frameworks.

To address this issue we released Pelage, an open-source Python package to facilitate data exploration and testing, which relies on Polars intuitive syntax and speed. Pelage empowers data scientists and analysts to facilitate data transformation, enhance data quality and improve code clarity.

We will demonstrate, in a test-first approach, how you can use this library in a meaningful data science workflow to gain greater confidence for your data transformations.

See website: https://alixtc.github.io/pelage/

Lakeflow Observability: From UI Monitoring to Deep Analytics

Monitoring data pipelines is key to reliability at scale. In this session, we’ll dive into the observability experience in Lakeflow, Databricks’ unified DE solution — from intuitive UI monitoring to advanced event analysis, cost observability and custom dashboards. We’ll walk through the revamped UX for Lakeflow observability, showing how to: Monitor runs and task states, dependencies and retry behavior in the UI Set up alerts for job and pipeline outcomes + failures Use pipeline and job system tables for historical insights Explore run events and event logs for root cause analysis Analyze metadata to understand and optimize pipeline spend How to build custom dashboards using system tables to track performance data quality, freshness, SLAs and failure trends, and drive automated alerting based on real-time signals This session will help you unlock full visibility into your data workflows.

Health Data, Delivered: How Lakeflow Declarative Pipelines Powers the HealthVerity Marketplace

Building scalable, reliable ETL pipelines is a challenge for organizations managing large, diverse data sources. Theseus, our custom ETL framework, streamlines data ingestion and transformation by fully leveraging Databricks-native capabilities, including Lakeflow Declarative Pipelines, auto loader and event-driven orchestration. By decoupling supplier logic and implementing structured bronze, silver, and gold layers, Theseus ensures high-performance, fault-tolerant data processing with minimal operational overhead. The result? Faster time-to-value, simplified governance and improved data quality — all within a declarative framework that reduces engineering effort. In this session, we’ll explore how Theseus automates complex data workflows, optimizes cost efficiency and enhances scalability, showcasing how Databricks-native tools drive real business outcomes.

Automating Engineering with AI - LLMs in Metadata Driven Frameworks

The demand for data engineering keeps growing, but data teams are bored by repetitive tasks, stumped by growing complexity and endlessly harassed by an unrelenting need for speed. What if AI could take the heavy lifting off your hands? What if we make the move away from code-generation and into config-generation — how much more could we achieve? In this session, we’ll explore how AI is revolutionizing data engineering, turning pain points into innovation. Whether you’re grappling with manual schema generation or struggling to ensure data quality, this session offers practical solutions to help you work smarter, not harder. You’ll walk away with a good idea of where AI is going to disrupt the data engineering workload, some good tips around how to accelerate your own workflows and an impending sense of doom around the future of the industry!

Sponsored by: Anomalo | Reconciling IoT, Policy, and Insurer Data to Deliver Better Customer Discounts

As insurers increasingly leverage IoT data to personalize policy pricing, reconciling disparate datasets across devices, policies, and insurers becomes mission-critical. In this session, learn how Nationwide transitioned from prototype workflows in Dataiku to a hardened data stack on Databricks, enabling scalable data governance and high-impact analytics. Discover how the team orchestrates data reconciliation across Postgres, Oracle, and Databricks to align customer driving behavior with insurer and policy data—ensuring more accurate, fair discounts for policyholders. With Anomalo’s automated monitoring layered on top, Nationwide ensures data quality at scale while empowering business units to define custom logic for proactive stewardship. We’ll also look ahead to how these foundations are preparing the enterprise for unstructured data and GenAI initiatives.

Sponsored by: Soda Data Inc. | Clean Energy, Clean Data: How Data Quality Powers Decarbonization

Drawing on BDO Canada’s deep expertise in the electricity sector, this session explores how clean energy innovation can be accelerated through a holistic approach to data quality. Discover BDO’s practical framework for implementing data quality and rebuilding trust in data through a structured, scalable approach. BDO will share a real-world example of monitoring data at scale—from high-level executive dashboards to the details of daily ETL and ELT pipelines. Learn how they leveraged Soda’s data observability platform to unlock near-instant insights, and how they moved beyond legacy validation pipelines with built-in checks across their production Lakehouse. Whether you're a business leader defining data strategy or a data engineer building robust data products, this talk connects the strategic value of clean data with actionable techniques to make it a reality.

Lakeflow in Production: CI/CD, Testing and Monitoring at Scale

Building robust, production-grade data pipelines goes beyond writing transformation logic — it requires rigorous testing, version control, automated CI/CD workflows and a clear separation between development and production. In this talk, we’ll demonstrate how Lakeflow, paired with Databricks Asset Bundles (DABs), enables Git-based workflows, automated deployments and comprehensive testing for data engineering projects. We’ll share best practices for unit testing, CI/CD automation, data quality monitoring and environment-specific configurations. Additionally, we’ll explore observability techniques and performance tuning to ensure your pipelines are scalable, maintainable and production-ready.

Sponsored by: Oxylabs | Web Scraping and AI: A Quiet but Critical Partnership

Behind every powerful AI system lies a critical foundation: fresh, high-quality web data. This session explores the symbiotic relationship between web scraping and artificial intelligence that's transforming how technical teams build data-intensive applications. We'll showcase how this partnership enables crucial use cases: analyzing trends, forecasting behaviors, and enhancing AI models with real-time information. Technical challenges that once made web scraping prohibitively complex are now being solved through the very AI systems they help create. You'll learn how machine learning revolutionizes web data collection, making previously impossible scraping projects both feasible and maintainable, while dramatically reducing engineering overhead and improving data quality. Join us to explore this quiet but critical partnership that's powering the next generation of AI applications.

How the Texas Rangers Use a Unified Data Platform to Drive World Class Baseball Analytics

Don't miss this session where we demonstrate how the Texas Rangers baseball team is staying one step ahead of the competition by going back to the basics. After implementing a modern data strategy with Databricks and winnng the 2023 World Series the rest of the league quickly followed suit. Now more than ever, data and AI are a central pillar of every baseball team's strategy driving profound insights into player performance and game dynamics. With a 'fundamentals win games' back to the basics focus, join us as we explain our commmitment to world-class data quality, engineering, and MLOPS by taking full advantage of the Databricks Data Intelligence Platform. From system tables to federated querying, find out how the Rangers use every tool at their disposal to stay one step ahead in the hyper competitive world of baseball.

Sponsored by: KPMG | Enhancing Regulatory Compliance through Data Quality and Traceability

In highly regulated industries like financial services, maintaining data quality is an ongoing challenge. Reactive measures often fail to prevent regulatory penalties, causing inaccuracies in reporting and inefficiencies due to poor data visibility. Regulators closely examine the origins and accuracy of reporting calculations to ensure compliance. A robust system for data quality and lineage is crucial. Organizations are utilizing Databricks to proactively improve data quality through rules-based and AI/ML-driven methods. This fosters complete visibility across IT, data management, and business operations, facilitating rapid issue resolution and continuous data quality enhancement. The outcome is quicker, more accurate, transparent financial reporting. We will detail a framework for data observability and offer practical examples of implementing quality checks throughout the data lifecycle, specifically focusing on creating data pipelines for regulatory reporting,