Send us a text In this episode, we explore how public media can build scalable, transparent, and mission-driven data infrastructure - with Emilie Nenquin, Head of Data & Intelligence at VRT, and Stijn Dolphen, Team Lead & Analytics Engineer at Dataroots. Emilie shares how she architected VRT’s data transformation from the ground up: evolving from basic analytics to a full-stack data organization with 45+ specialists across engineering, analytics, AI, and user management. We dive into the strategic shift from Adobe Analytics to Snowplow, and what it means to own your data pipeline in a public service context. Stijn joins to unpack the technical decisions behind VRT’s current architecture, including real-time event tracking, metadata modeling, and integrating 70+ digital platforms into a unified ecosystem. 💡 Topics include: Designing data infrastructure for transparency and scaleBuilding a modular, privacy-conscious analytics stackMetadata governance across fragmented content systemsRecommendation systems for discovery, not just engagementThe circular relationship between data quality and AI performanceApplying machine learning in service of cultural and civic missionsWhether you're leading a data team, rethinking your stack, or exploring ethical AI in media, this episode offers practical insights into how data strategy can align with public value.
talk-data.com
Topic
Data Quality
537
tagged
Activity Trend
Top Events
In a world where every decision is data-driven, quality is the difference between insight and noise. At Picnic, Europe’s fastest-growing online supermarket, data quality is not just a technical challenge—it’s the invisible engine behind an AI-driven supply chain that delivers millions of groceries on time, with zero food waste. In this keynote we will share how a relentless focus on data quality fuels innovation, drives operational excellence, and reshapes the customer experience. Discover how Picnic blends automation, culture, and cutting-edge tooling to turn data into a strategic asset—delivering impact at scale. Get inspired to rethink data quality as a catalyst for transformation—not just hygiene.
The line between human work and AI capabilities is blurring in today's business environment. AI agents are now handling autonomous tasks across customer support, data management, and sales prospecting with increasing sophistication. But how do you effectively integrate these agents into your existing workflows? What's the right approach to training and evaluating AI team members? With data quality being the foundation of successful AI implementation, how can you ensure your systems have the unified context they need while maintaining proper governance and privacy controls? Karen Ng is the Head of Product at HubSpot, where she leads product strategy, design, and partnerships with the mission of helping millions of organizations grow better. Since joining in 2022, she has driven innovation across Smart CRM, Operations Hub, Breeze Intelligence, and the developer ecosystem, with a focus on unifying structured and unstructured data to make AI truly useful for businesses. Known for leading with clarity and “AI speed,” she pushes HubSpot to stay ahead of disruption and empower customers to thrive. Previously, Karen held senior product leadership roles at Common Room, Google, and Microsoft. At Common Room, she built the product and data science teams from the ground up, while at Google she directed Android’s product frameworks like Jetpack and Jetpack Compose. During more than a decade at Microsoft, she helped shape the company’s .NET strategy and launched the Roslyn compiler platform. Recognized as a Product 50 Winner and recipient of the PM Award for Technical Strategist, she also advises and invests in high-growth technology companies. In the episode, Richie and Karen explore the evolving role of AI agents in sales, marketing, and support, the distinction between chatbots, co-pilots, and autonomous agents, the importance of data quality and context, the concept of hybrid teams, the future of AI-driven business processes, and much more. Links Mentioned in the Show: Hubspot Breeze AgentsConnect with KarenWebinar: Pricing & Monetizing Your AI Products with Sam Lee, VP of Pricing Strategy & Product Operations at HubSpotRelated Episode: Enterprise AI Agents with Jun Qian, VP of Generative AI Services at OracleRewatch RADAR AI New to DataCamp? Learn on the go using the DataCamp mobile appEmpower your business with world-class data and AI skills with DataCamp for business
There's no better time to become a data engineer. And acing the AWS Certified Data Engineer Associate (DEA-C01) exam will help you tackle the demands of modern data engineering and secure your place in the technology-driven future. Authors Sakti Mishra, Dylan Qu, and Anusha Challa equip you with the knowledge and sought-after skills necessary to effectively manage data and excel in your career. Whether you're a data engineer, data analyst, or machine learning engineer, you'll discover in-depth guidance, practical exercises, sample questions, and expert advice you need to leverage AWS services effectively and achieve certification. By reading, you'll learn how to: Ingest, transform, and orchestrate data pipelines effectively Select the ideal data store, design efficient data models, and manage data lifecycles Analyze data rigorously and maintain high data quality standards Implement robust authentication, authorization, and data governance protocols Prepare thoroughly for the DEA-C01 exam with targeted strategies and practices
The relationship between AI and data professionals is evolving rapidly, creating both opportunities and challenges. As companies embrace AI-first strategies and experiment with AI agents, the skills needed to thrive in data roles are fundamentally changing. Is coding knowledge still essential when AI can generate code for you? How important is domain expertise when automated tools can handle technical tasks? With data engineering and analytics engineering gaining prominence, the focus is shifting toward ensuring data quality and building reliable pipelines. But where does the human fit in this increasingly automated landscape, and how can you position yourself to thrive amid these transformations? Megan Bowers is Senior Content Manager, Digital Customer Success at Alteryx, where she develops resources for the Maveryx Community. She writes technical blogs and hosts the Alter Everything podcast, spotlighting best practices from data professionals across the industry. Before joining Alteryx, Megan worked as a data analyst at Stanley Black & Decker, where she led ETL and dashboarding projects and trained teams on Alteryx and Power BI. Her transition into data began after earning a degree in Industrial Engineering and completing a data science bootcamp. Today, she focuses on creating accessible, high-impact content that helps data practitioners grow. Her favorite topics include switching career paths after college, building a professional brand on LinkedIn, writing technical blogs people actually want to read, and best practices in Alteryx, data visualization, and data storytelling. Presented by Alteryx, Alter Everything serves as a podcast dedicated to the culture of data science and analytics, showcasing insights from industry specialists. Covering a range of subjects from the use of machine learning to various analytics career trajectories, and all that lies between, Alter Everything stands as a celebration of the critical role of data literacy in a data-driven world. In the episode, Richie and Megan explore the impact of AI on job functions, the rise of AI agents in business, and the importance of domain knowledge and process analytics in data roles. They also discuss strategies for staying updated in the fast-paced world of AI and data science, and much more. Links Mentioned in the Show: Alter EverythingConnect with MeganSkill Track: Alteryx FundamentalsRelated Episode: Scaling Enterprise Analytics with Libby Duane Adams, Chief Advocacy Officer and Co-Founder of AlteryxRewatch RADAR AI New to DataCamp? Learn on the go using the DataCamp mobile appEmpower your business with world-class data and AI skills with DataCamp for business
Petr Janda discusses insights and lessons from building a data quality agent.
What if the reason your data strategy is failing has nothing to do with technology—and everything to do with storytelling? In this episode of Data Unchained, host Molly Presley sits down with Scott Taylor, “The Data Whisperer,” to unpack why data leaders keep missing the mark when trying to engage the business. With decades of experience helping global enterprises understand the value of foundational data, Scott makes a powerful case for why “data quality” doesn’t sell, why AI without clean inputs is doomed, and why storytelling—not tooling—is the missing link between data teams and the C-suite. If you’ve ever struggled to get executive buy-in or make your data projects stick, this conversation will change the way you frame your value. Scott Taylor: https://www.linkedin.com/in/scottmztaylor/ Cyberpunk by jiglr | https://soundcloud.com/jiglrmusic Music promoted by https://www.free-stock-music.com Creative Commons Attribution 3.0 Unported License https://creativecommons.org/licenses/by/3.0/deed.en_US Hosted on Acast. See acast.com/privacy for more information.
Many organisations fall into the trap of tackling data quality reactively, producing single-use reports and expecting users to rerun them like a game of whack-a-mole. In this session, we’ll challenge that mindset and explore smarter, scalable alternatives.
Through practical examples and accessible tools, I’ll demonstrate how to embed a data quality culture that flags issues early, routes them to the right people, and tracks ongoing trends.
No fancy expensive tools, just practical steps and hard learnt lessons on building a data quality culture and system that has an impact.
Business intelligence has been transforming organizations for decades, yet many companies still struggle with widespread adoption. With less than 40% of employees in most organizations having access to BI tools, there's a significant 'information underclass' making decisions without data-driven insights. How can businesses bridge this gap and achieve true information democracy? While new technologies like generative AI and semantic layers offer promising solutions, the fundamentals of data quality and governance remain critical. What balance should organizations strike between investing in innovative tools and strengthening their data infrastructure? How can you ensure your business becomes a 'data athlete' capable of making hyper-decisive moves in an uncertain economic landscape? Howard Dresner is founder and Chief Research Officer at Dresner Advisory Services and a leading voice in Business Intelligence (BI), credited with coining the term “Business Intelligence” in 1989. He spent 13 years at Gartner as lead BI analyst, shaping its research agenda and earning recognition as Analyst of the Year, Distinguished Analyst, and Gartner Fellow. He also led Gartner’s BI conferences in Europe and North America. Before founding Dresner Advisory in 2007, Howard was Chief Strategy Officer at Hyperion Solutions, where he drove strategy and thought leadership, helping position Hyperion as a leader in performance management prior to its acquisition by Oracle. Howard has written two books, The Performance Management Revolution – Business Results through Insight and Action, and Profiles in Performance – Business Intelligence Journeys and the Roadmap for Change - both published by John Wiley & Sons. In the episode, Richie and Howard explore the surprising low penetration of business intelligence in organizations, the importance of data governance and infrastructure, the evolving role of AI in BI, and the strategic initiatives driving BI usage, and much more. Links Mentioned in the Show: Dresner Advisory ServicesHoward’s Book - Profiles in Performance: Business Intelligence Journeys and the Roadmap for ChangeConnect with HowardSkill Track: Power BI FundamentalsRelated Episode: The Next Generation of Business Intelligence with Colin Zima, CEO at OmniRewatch RADAR AI New to DataCamp? Learn on the go using the DataCamp mobile appEmpower your business with world-class data and AI skills with DataCamp for business
Training Large Language Models (LLMs) requires processing massive-scale datasets efficiently. Traditional CPU-based data pipelines struggle to keep up with the exponential growth of data, leading to bottlenecks in model training. In this talk, we present NeMo Curator, an accelerated, scalable Python-based framework designed to curate high-quality datasets for LLMs efficiently. Leveraging GPU-accelerated processing with RAPIDS, NeMo Curator provides modular pipelines for synthetic data generation, deduplication, filtering, classification, and PII redaction—improving data quality and training efficiency.
We will showcase real-world examples demonstrating how multi-node, multi-GPU processing scales dataset preparation to 100+ TB of data, achieving up to 7% improvement in LLM downstream tasks. Attendees will gain insights into configurable pipelines that enhance training workflows, with a focus on reproducibility, scalability, and open-source integration within Python's scientific computing ecosystem.
Summary In this episode of the Data Engineering Podcast Effie Baram, a leader in foundational data engineering at Two Sigma, talks about the complexities and innovations in data engineering within the finance sector. She discusses the critical role of data at Two Sigma, balancing data quality with delivery speed, and the socio-technical challenges of building a foundational data platform that supports research and operational needs while maintaining regulatory compliance and data quality. Effie also shares insights into treating data as code, leveraging modern data warehouses, and the evolving role of data engineers in a rapidly changing technological landscape.
Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementData migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details. This episode is brought to you by Coresignal, your go-to source for high-quality public web data to power best-in-class AI products. Instead of spending time collecting, cleaning, and enriching data in-house, use ready-made multi-source B2B data that can be smoothly integrated into your systems via APIs or as datasets. With over 3 billion data records from 15+ online sources, Coresignal delivers high-quality data on companies, employees, and jobs. It is powering decision-making for more than 700 companies across AI, investment, HR tech, sales tech, and market intelligence industries. A founding member of the Ethical Web Data Collection Initiative, Coresignal stands out not only for its data quality but also for its commitment to responsible data collection practices. Recognized as the top data provider by Datarade for two consecutive years, Coresignal is the go-to partner for those who need fresh, accurate, and ethically sourced B2B data at scale. Discover how Coresignal's data can enhance your AI platforms. Visit dataengineeringpodcast.com/coresignal to start your free 14-day trial. Your host is Tobias Macey and today I'm interviewing Effie Baram about data engineering in the finance sectorInterview IntroductionHow did you get involved in the area of data management?Can you start by outlining the role of data in the context of Two Sigma?What are some of the key characteristics of the types of data sources that you work with?Your role is leading "foundational data engineering" at Two Sigma. Can you unpack that title and how it shapes the ways that you think about what you build?How does the concept of "foundational data" influence the ways that the business thinks about the organizational patterns around data?Given the regulatory environment around finance, how does that impact the ways that you think about the "what" and "how" of the data that you deliver to data consumers?Being the foundational team for data use at Two Sigma, how have you approached the design and architecture of your technical systems?How do you think about the boundaries between your responsibilities and the rest of the organization?What are the design patterns that you have found most helpful in empowering data consumers to build on top of your work?What are some of the elements of sociotechnical friction that have been most challenging to address?What are the most interesting, innovative, or unexpected ways that you have seen the ideas around "foundational data" applied in your organization?What are the most interesting, unexpected, or challenging lessons that you have learned while working with financial data?When is a foundational data team the wrong approach?What do you have planned for the future of your platform design?Contact Info LinkedInParting Question From your perspective, what is the biggest gap in the tooling or technology for data management today?Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story.Links 2SigmaReliability EngineeringSLA == Service-Level AgreementAirflowParquet File FormatBigQuerySnowflakedbtGemini AssistMCP == Model Context ProtocoldtraceThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Approaches to maintaining data quality amidst AI-driven analytics.
What if your Airflow tasks could understand natural language AND adapt to schema changes automatically, while maintaining the deterministic, observable workflows we rely on? This talk introduces practical patterns for AI-native orchestration that preserve Airflow’s strengths while adding intelligence where it matters most. Through a real-world example, we’ll demonstrate AI-powered tasks that detect schema drift across multi-cloud systems and perform context-aware data quality checks that go beyond simple validation—understanding business rules, detecting anomalies, and generating validation queries from prompts like “check data quality across regions.” All within static DAG structures you can test and debug normally. We’ll show how AI becomes a first-class citizen by combining Airflow’s features, assets for schema context, Human-in-the-Loop for approvals, and AssetWatchers for automated triggers, with engines such as Apache DataFusion for high-performance query execution and support for cross-cloud data processing with unified access to multiple storage formats. These patterns apply directly to schema validation and similar cases where natural language can simplify complex operations. This isn’t about bolting AI onto Airflow. It’s about evolving how we build workflows, from brittle rules to intelligent adaptation, while keeping everything testable, auditable, and production-ready.
I will talk about how Apache Airflow is used in the healthcare sector with the integration of LLMs to enhance efficiency. Healthcare generates vast volumes of unstructured data daily, from clinical notes and patient intake forms to chatbot conversations and telehealth reports. Medical teams struggle to keep up, leading to delays in triage and missed critical symptoms. This session explores how Apache Airflow can be the backbone of an automated healthcare triage system powered by Large Language Models (LLMs). I’ll demonstrate how I designed and implemented an Airflow DAG orchestration pipeline that automates the ingestion, processing, and analysis of patient data from diverse sources in real-time. Airflow schedules and coordinates data extraction, preprocessing, LLM-based symptom extraction, and urgency classification, and finally routes actionable insights to healthcare professionals. The session will focus on the following; Managing complex workflows in healthcare data pipelines Safely integrating LLM inference calls into Airflow tasks Designing human-in-the-loop checkpoints for ethical AI usage Monitoring workflow health and data quality with Airflow.
As modern data ecosystems grow in complexity, ensuring transparency, discoverability, and governance in data workflows becomes critical. Apache Airflow, a powerful workflow orchestration tool, enables data engineers to build scalable pipelines, but without proper visibility into data lineage, ownership, and quality, teams risk operating in a black box. In this talk, we will explore how integrating Airflow with a data catalog can bring clarity and transparency to data workflows. We’ll discuss how metadata-driven orchestration enhances data governance, enables lineage tracking, and improves collaboration across teams. Through real-world use cases, we will demonstrate how Airflow can automate metadata collection, update data catalogs dynamically, and ensure data quality at every stage of the pipeline. Attendees will walk away with practical strategies for implementing a transparent data workflow that fosters trust, efficiency, and compliance in their data infrastructure.
Tekmetric is the largest cloud based auto shop management system in the United States. We process vast amounts of data from various integrations with internal and external systems. Data quality and governance are crucial for both our internal operations and the success of our customers. We leverage multi-step data processing pipelines using AWS services and Airflow. While we utilize traditional data pipeline workflows to manage and move data, we go beyond standard orchestration. After data is processed, we apply tailored quality checks for schema validation, record completeness, freshness, duplication and more. In this talk, we’ll explore how Airflow allows us to enhance data observability. We’ll discuss how Airflow’s flexibility enables seamless integration and monitoring across different teams and datasets, ensuring reliable and accurate data at every stage. This session will highlight how Tekmetric uses data quality governance and observability practices to drive business success through trusted data.
We have a similar pattern of DAGs running for different data quality dimensions like accuracy, timeliness, & completeness. To do this again and again, we would be duplicating and potentially introducing human error while doing copy paste of code or making people write same code again. To solve for this, we are doing few things: Run DAGs via DagFactory to dynamically generate DAGs using just some YAML code for all the steps we want to run in our DQ checks. Hide this behind a UI which is hooked to github PR open step, now the user just provides some inputs or selects from dropdown in UI and a YAML DAG is generated for them. This highlights the potential for DAGFactory to hide Airflow Python code from users and make it more accessible to Data Analysts and Business Intelligence along with normal Software Engg, along with reducing human error. YAML is the perfect format to be able to generate code, create a PR and DagFactory is the perfect fir for that. All of this is running in GCP Cloud Composer.
This session showcases Okta’s innovative approach to data pipeline orchestration with dbt and Airflow. How we’ve implemented dynamically generated airflow dags workflows based on dbt’s dependency graph. This allows us to enforce strict data quality standards by automatically executing downstream model tests before upstream model deployments, effectively preventing error cascades. The entire CI/CD pipeline, from dbt model changes to production DAG deployment, is fully automated. The result? Accelerated development cycles, reduced operational overhead, and bulletproof data reliability
This session explores how to bring unit testing to SQL pipelines using Airflow. I’ll walk through the development of a SQL testing library that allows isolated testing of SQL logic by injecting mock data into base tables. To support this, we built a type system for AWS Glue tables using Pydantic, enabling schema validation and mock data generation. Over time, this type system also powered production data quality checks via a custom Airflow operator. Learn how this approach improves reliability, accelerates development, and scales testing across data workflows.
Ever seen a DAG go rogue and deploy itself? Or try to time travel back to 1999? Join us for a light-hearted yet painfully relatable look at how not to scale your Airflow deployment to avoid chaos and debugging nightmares. We’ll cover the classics: hardcoded secrets, unbounded retries (hello, immortal task!), and the infamous spaghetti DAG where 200 tasks are lovingly connected by hand and no one dares open the Airflow UI anymore. If you’ve ever used datetime.now() in your DAG definition and watched your backfills implode, this talk is for you. From the BashOperator that became sentient to the XCom that tried to pass a whole Pandas DataFrame and the key to your mother’s house, we’ll walk through real-world bloopers with practical takeaways. You’ll learn why overusing PythonOperator is a recipe for mess, how not to use sensors unless you enjoy resource starvation, and why scheduling in local timezones is basically asking for a daylight savings time horror story. Other highlights include: Over-provisioning resources in KubernetesPodOperator: many teams allocate excessive memory/CPU “just in case”, leading to cluster contention and resource waste. Dynamic task mapping gone wild: 10,000 mapped tasks later… the scheduler is still crying. SLAs used as data quality guarantees: creating alerts so noisy, nobody listens. Design-free DAGs: no docs, no comments, no idea why a task has a 3-day timeout. Finally, we’ll round it out with some dos and don’ts: using environment variables, avoiding memory-hungry monolith DAGs, skipping global imports, and not allocating 10x more memory “just in case.” Whether you’re new to Airflow or battle-hardened from a thousand failed backfills, come learn how to scale your pipelines without losing your mind (or your cluster).