talk-data.com talk-data.com

Topic

Data Quality

data_management data_cleansing data_validation

537

tagged

Activity Trend

82 peak/qtr
2020-Q1 2026-Q1

Activities

537 activities · Newest first

Summary In this episode of the Data Engineering Podcast Chakravarthy Kotaru talks about scaling data operations through standardized platform offerings. From his roots as an Oracle developer to leading the data platform at a major online travel company, Chakravarthy shares insights on managing diverse database technologies and providing databases as a service to streamline operations. He explains how his team has transitioned from DevOps to a platform engineering approach, centralizing expertise and automating repetitive tasks with AWS Service Catalog. Join them as they discuss the challenges of migrating legacy systems, integrating AI and ML for automation, and the importance of organizational buy-in in driving data platform success.

Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementData migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details.This is a pharmaceutical Ad for Soda Data Quality. Do you suffer from chronic dashboard distrust? Are broken pipelines and silent schema changes wreaking havoc on your analytics? You may be experiencing symptoms of Undiagnosed Data Quality Syndrome — also known as UDQS. Ask your data team about Soda. With Soda Metrics Observability, you can track the health of your KPIs and metrics across the business — automatically detecting anomalies before your CEO does. It’s 70% more accurate than industry benchmarks, and the fastest in the category, analyzing 1.1 billion rows in just 64 seconds. And with Collaborative Data Contracts, engineers and business can finally agree on what “done” looks like — so you can stop fighting over column names, and start trusting your data again.Whether you’re a data engineer, analytics lead, or just someone who cries when a dashboard flatlines, Soda may be right for you. Side effects of implementing Soda may include: Increased trust in your metrics, reduced late-night Slack emergencies, spontaneous high-fives across departments, fewer meetings and less back-and-forth with business stakeholders, and in rare cases, a newfound love of data. Sign up today to get a chance to win a $1000+ custom mechanical keyboard. Visit dataengineeringpodcast.com/soda to sign up and follow Soda’s launch week. It starts June 9th.Your host is Tobias Macey and today I'm interviewing Chakri Kotaru about scaling successful data operations through standardized platform offeringsInterview IntroductionHow did you get involved in the area of data management?Can you start by outlining the different ways that you have seen teams you work with fail due to lack of structure and opinionated design?Why NoSQL?Pairing different styles of NoSQL for different problemsUseful patterns for each NoSQL style (document, column family, graph, etc.)Challenges in platform automation and scaling edge casesWhat challenges do you anticipate as a result of the new pressures as a result of AI applications?What are the most interesting, innovative, or unexpected ways that you have seen platform engineering practices applied to data systems?What are the most interesting, unexpected, or challenging lessons that you have learned while working on data platform engineering?When is NoSQL the wrong choice?What do you have planned for the future of platform principles for enabling data teams/data applications?Contact Info LinkedInParting Question From your perspective, what is the biggest gap in the tooling or technology for data management today?Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story.Links RiakDynamoDBSQL ServerCassandraScyllaDBCAP TheoremTerraformAWS Service CatalogBlog PostThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Summary In this episode of the Data Engineering Podcast Tulika Bhatt, a senior software engineer at Netflix, talks about her experiences with large-scale data processing and the future of data engineering technologies. Tulika shares her journey into the data engineering field, discussing her work at BlackRock and Verizon before joining Netflix, and explains the challenges and innovations involved in managing Netflix's impression data for personalization and user experience. She highlights the importance of balancing off-the-shelf solutions with custom-built systems using technologies like Spark, Flink, and Iceberg, and delves into the complexities of ensuring data quality and observability in high-speed environments, including robust alerting strategies and semantic data auditing.

Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementData migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details.Your host is Tobias Macey and today I'm interviewing Tulika Bhatt about her experiences working on large scale data processing and her insights on the future trajectory of the supporting technologiesInterview IntroductionHow did you get involved in the area of data management?Can you start by outlining the ways that operating at large scale change the ways that you need to think about the design of data systems?When dealing with small-scale data systems it can be feasible to have manual processes. What are the elements of large scal data systems that demand autopmation?How can those large-scale automation principles be down-scaled to the systems that the rest of the world are operating?A perennial problem in data engineering is that of data quality. The past 4 years has seen a significant growth in the number of tools and practices available for automating the validation and verification of data. In your experience working with high volume data flows, what are the elements of data validation that are still unsolved?Generative AI has taken the world by storm over the past couple years. How has that changed the ways that you approach your daily work?What do you see as the future realities of working with data across various axes of large scale, real-time, etc.?What are the most interesting, innovative, or unexpected ways that you have seen solutions to large-scale data management designed?What are the most interesting, unexpected, or challenging lessons that you have learned while working on data management across axes of scale?What are the ways that you are thinking about the future trajectory of your work??Contact Info LinkedInParting Question From your perspective, what is the biggest gap in the tooling or technology for data management today?Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story.Links BlackRockSparkFlinkKafkaCassandraRocksDBNetflix Maestro workflow orchestratorPagerdutyIcebergThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Lloyds Banking Group is on a bold mission to become the UK’s biggest FinTech. As the bank transforms, trusted data is key to driving innovation, compliance, and better customer outcomes. In this session, Kiran Bal, Head of Data Quality, shares how her team is embedding quality, ownership, and governance across the enterprise. Learn how they’re aligning with the CFO and business leadership through actionable metrics, stakeholder engagement, and a federated approach to trusted data at scale.

Traditional approaches and thinking around data quality are out of date and not sufficient in the era of AI. Data, analytics and AI leaders will need to reconsider their approach to data quality going beyond the traditional six data quality dimensions. This session will help data leaders learn to think about data quality in a holistic way that support making data AI-ready.

Metadata, data quality and data observability tools provide significant capabilities to ensure good data for your BI and AI initiatives. Metadata tools help discover, and inventory your data assets. Data quality tools help business users manage their data at sources by setting rules and policies. Data observability tools give organizations integrated visibility over the health of data, data pipeline and data landscape. Together the tools help organizations lay good foundation in data management for BI and AI initiatives.

Accelerating AI use cases demands strong data governance and many organizations struggle to manage complex, growing data volumes effectively. This session explores essential strategies for building a solid data governance foundation. Learn how organizations are overcoming common data governance obstacles, like data silos and inconsistent rules, to achieve measurable gains in data quality and efficiency. Through real-world examples, discover how unified data platforms can simplify data discovery, classification, and policy enforcement, leading to faster, data-driven decisions and reduced risk.

As we enter 2025, the evolution of agentic architectures—AI agents capable of autonomous decision-making—will hinge on one critical factor: data quality. High-quality, reliable data is the foundation for AI readiness. This session explores the interplay between data quality, data observability & agentic AI, highlighting DQLabs’ approach to more autonomic platform. Discover how AI-driven automation enhances data accuracy & reliability, reduces cost & manual effort, prepares organizations for the agentic era with scalable, self-optimizing systems built for AI success.

Dive into the symbiotic relationship between data and Artificial Intelligence in this comprehensive session. Explore how robust data foundations are critical for developing effective AI systems and how AI, in turn, refines and enhances data quality. Gain actionable insights into transforming raw data into intelligent solutions and leveraging AI to drive business innovation.

Three out of four companies are betting big on AI – but most are digging on shifting ground. In this $100 billion gold rush, none of these investments will pay off without data quality and strong governance – and that remains a challenge for many organizations. Not every enterprise has a solid data governance practice and maturity models vary widely. As a result, investments in innovation initiatives are at risk of failure. What are the most important data management issues to prioritize? See how your organization measures up and get ahead of the curve with Actian.

The roles within AI engineering are as diverse as the challenges they tackle. From integrating models into larger systems to ensuring data quality, the day-to-day work of AI professionals is anything but routine. How do you navigate the complexities of deploying AI applications? What are the key steps from prototype to production? For those looking to refine their processes, understanding the full lifecycle of AI development is essential. Let's delve into the intricacies of AI engineering and the strategies that lead to successful implementation. Maxime Labonne is a Senior Staff Machine Learning Scientist at Liquid AI, serving as the head of post-training. He holds a Ph.D. in Machine Learning from the Polytechnic Institute of Paris and is recognized as a Google Developer Expert in AI/ML. An active blogger, he has made significant contributions to the open-source community, including the LLM Course on GitHub, tools such as LLM AutoEval, and several state-of-the-art models like NeuralBeagle and Phixtral. He is the author of the best-selling book “Hands-On Graph Neural Networks Using Python,” published by Packt. Paul-Emil Iusztin designs and implements modular, scalable, and production-ready ML systems for startups worldwide. He has extensive experience putting AI and generative AI into production. Previously, Paul was a Senior Machine Learning Engineer at Metaphysic.ai and a Machine Learning Lead at Core.ai. He is a co-author of The LLM Engineer's Handbook, a best seller in the GenAI space. In the episode, Richie, Maxime, and Paul explore misconceptions in AI application development, the intricacies of fine-tuning versus few-shot prompting, the limitations of current frameworks, the roles of AI engineers, the importance of planning and evaluation, the challenges of deployment, and the future of AI integration, and much more. Links Mentioned in the Show: Maxime’s LLM Course on HuggingFaceMaxime and Paul’s Code Alongs on DataCampDecoding ML on SubstackConnect with Maxime and PaulSkill Track: AI FundamentalsRelated Episode: Building Multi-Modal AI Applications with Russ d'Sa, CEO & Co-founder of LiveKitRewatch sessions from RADAR: Skills Edition New to DataCamp? Learn on the go using the DataCamp mobile appEmpower your business with world-class data and AI skills with DataCamp for business

Sarah McKenna joins me to chat about all things web scraping. We discuss its applications, the evolution of alternative data, and AI's impact on the industry. We also discuss privacy concerns, the challenges of bot blocking, and the importance of data quality. Sarah shares ideas on how to get started with web scraping and the ethical considerations surrounding copyright and data collection.

Data Engineering Design Patterns

Data projects are an intrinsic part of an organization's technical ecosystem, but data engineers in many companies continue to work on problems that others have already solved. This hands-on guide shows you how to provide valuable data by focusing on various aspects of data engineering, including data ingestion, data quality, idempotency, and more. Author Bartosz Konieczny guides you through the process of building reliable end-to-end data engineering projects, from data ingestion to data observability, focusing on data engineering design patterns that solve common business problems in a secure and storage-optimized manner. Each pattern includes a user-facing description of the problem, solutions, and consequences that place the pattern into the context of real-life scenarios. Throughout this journey, you'll use open source data tools and public cloud services to apply each pattern. You'll learn: Challenges data engineers face and their impact on data systems How these challenges relate to data system components Useful applications of data engineering patterns How to identify and fix issues with your current data components Technology-agnostic solutions to new and existing data projects, with open source implementation examples Bartosz Konieczny is a freelance data engineer who's been coding since 2010. He's held various senior hands-on positions that allowed him to work on many data engineering problems in batch and stream processing.

Join Aritzia, Anomalo, and Google Cloud to learn how Aritzia automates data quality across 500+ sources in BigQuery. Discover how integrating Anomalo with Google Cloud helps proactively detect anomalies, maintain data integrity, and build trust in analytics. Explore how automation reduces time spent troubleshooting and increases time spent creating business value through reliable, AI-enhanced analytics.

Discover how to integrate AI and Gen AI capabilities to unblock data quality issues, streamline the deployment processes of a data platform, and empower data teams to accelerate the development of customized data products. By automating data product and pipeline creation, infrastructure deployment, data quality, and PII controls, you can reduce engineering efforts by 30-40% and develop products three times faster. Learn how this approach has helped clients create data products faster and more cost-efficiently across various industries.

This Session is hosted by a Google Cloud Next Sponsor.
Visit your registration profile at g.co/cloudnext to opt out of sharing your contact information with the sponsor hosting this session.

Send us a text Welcome to the cozy corner of the tech world where ones and zeros mingle with casual chit-chat. Datatopics Unplugged is your go-to spot for relaxed discussions around tech, news, data, and society. Dive into conversations that should flow as smoothly as your morning coffee (but don't), where industry insights meet laid-back banter. Whether you're a data aficionado or just someone curious about the digital age, pull up a chair, relax, and let's get into the heart of data, unplugged style! In this episode, host Murilo is joined by returning guest Paolo, Data Management Team Lead at dataroots, for a deep dive into the often-overlooked but rapidly evolving domain of unstructured data quality. Tune in for a field guide to navigating documents, images, and embeddings without losing your sanity. What we unpack: Data management basics: Metadata, ownership, and why Excel isn’t everything.Structured vs unstructured data: How the wild west of PDFs, images, and audio is redefining quality.Data quality challenges for LLMs: From apples and pears to rogue chatbots with “legally binding” hallucinations.Practical checks for document hygiene: Versioning, ownership, embedding similarity, and tagging strategies.Retrieval-Augmented Generation (RAG): When ChatGPT meets your HR policies and things get weird.Monitoring and governance: Building systems that flag rot before your chatbot gives out 2017 vacation rules.Tooling and gaps: Where open source is doing well—and where we’re still duct-taping workflows.Real-world inspirations: A look at how QuantumBlack (McKinsey) is tackling similar issues with their AI for DQ framework.

This scalable, AI-powered data quality solution requires minimal coding and maintenance. It learns about your data products to improve data quality across multiple dimensions. The framework uses BigQuery, BQML, Dataform, and Looker to deliver a comprehensive and automated Data Quality solution with a unified user experience for both data platform owners and business users.

Learn how the legendary retail brand accelerated AI adoption by building a GCP Data Platform and conducting enterprise-wide data transformation program. We’ll demonstrate how Big Query and other GCP services liberated the data from legacy environments and became the foundation for AI Factory initiative. We will highlight the challenges and solutions for data quality control, enterprise-wide stakeholder alignment, and business users engagement on the road to data value realization.

This Session is hosted by a Google Cloud Next Sponsor.
Visit your registration profile at g.co/cloudnext to opt out of sharing your contact information with the sponsor hosting this session.