talk-data.com talk-data.com

Topic

Data Contracts

data_governance data_quality data_engineering

26

tagged

Activity Trend

14 peak/qtr
2020-Q1 2026-Q1

Activities

26 activities · Newest first

In this episode, I sit down with Mark Freeman and Chad Sanderson (Gable.ai) to discuss the release of their new O’Reilly book, Data Contracts: Developing Production-Grade Pipelines at Scale. They dive deep into the chaotic journey of writing a 350-page book while simultaneously building a venture-backed startup. The conversation takes a sharp turn into the evolution of Data Contracts. While the concept started with data engineers, Mark and Chad explain why they pivoted their focus to software engineers. They argue that software engineers are facing a "Data Lake Moment, "prioritizing speed over craftsmanship, resulting in massive technical debt and integration failures.

Gable: https://www.gable.ai/

Summary In this episode of the Data Engineering Podcast we welcome back Nick Schrock, CTO and founder of Dagster Labs, to discuss the evolving landscape of data engineering in the age of AI. As AI begins to impact data platforms and the role of data engineers, Nick shares his insights on how it will ultimately enhance productivity and expand software engineering's scope. He delves into the current state of AI adoption, the importance of maintaining core data engineering principles, and the need for human oversight when leveraging AI tools effectively. Nick also introduces Dagster's new components feature, designed to modularize and standardize data transformation processes, making it easier for teams to collaborate and integrate AI into their workflows. Join in to explore the future of data engineering, the potential for AI to abstract away complexity, and the importance of open standards in preventing walled gardens in the tech industry.

Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementThis episode is brought to you by Coresignal, your go-to source for high-quality public web data to power best-in-class AI products. Instead of spending time collecting, cleaning, and enriching data in-house, use ready-made multi-source B2B data that can be smoothly integrated into your systems via APIs or as datasets. With over 3 billion data records from 15+ online sources, Coresignal delivers high-quality data on companies, employees, and jobs. It is powering decision-making for more than 700 companies across AI, investment, HR tech, sales tech, and market intelligence industries. A founding member of the Ethical Web Data Collection Initiative, Coresignal stands out not only for its data quality but also for its commitment to responsible data collection practices. Recognized as the top data provider by Datarade for two consecutive years, Coresignal is the go-to partner for those who need fresh, accurate, and ethically sourced B2B data at scale. Discover how Coresignal's data can enhance your AI platforms. Visit dataengineeringpodcast.com/coresignal to start your free 14-day trial. Data migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details. This is a pharmaceutical Ad for Soda Data Quality. Do you suffer from chronic dashboard distrust? Are broken pipelines and silent schema changes wreaking havoc on your analytics? You may be experiencing symptoms of Undiagnosed Data Quality Syndrome — also known as UDQS. Ask your data team about Soda. With Soda Metrics Observability, you can track the health of your KPIs and metrics across the business — automatically detecting anomalies before your CEO does. It’s 70% more accurate than industry benchmarks, and the fastest in the category, analyzing 1.1 billion rows in just 64 seconds. And with Collaborative Data Contracts, engineers and business can finally agree on what “done” looks like — so you can stop fighting over column names, and start trusting your data again.Whether you’re a data engineer, analytics lead, or just someone who cries when a dashboard flatlines, Soda may be right for you. Side effects of implementing Soda may include: Increased trust in your metrics, reduced late-night Slack emergencies, spontaneous high-fives across departments, fewer meetings and less back-and-forth with business stakeholders, and in rare cases, a newfound love of data. Sign up today to get a chance to win a $1000+ custom mechanical keyboard. Visit dataengineeringpodcast.com/soda to sign up and follow Soda’s launch week. It starts June 9th.Your host is Tobias Macey and today I'm interviewing Nick Schrock about lowering the barrier to entry for data platform consumersInterview IntroductionHow did you get involved in the area of data management?Can you start by giving your summary of the impact that the tidal wave of AI has had on data platforms and data teams?For anyone who hasn't heard of Dagster, can you give a quick summary of the project?What are the notable changes in the Dagster project in the past year?What are the ecosystem pressures that have shaped the ways that you think about the features and trajectory of Dagster as a project/product/community?In your recent release you introduced "components", which is a substantial change in how you enable teams to collaborate on data problems. What was the motivating factor in that work and how does it change the ways that organizations engage with their data?tension between being flexible and extensible vs. opinionated and constrainedincreased dependency on orchestration with LLM use casesreducing the barrier to contribution for data platform/pipelinesbringing application engineers into the mixchallenges of meeting users/teams where they are (languages, platform investments, etc.)What are the most interesting, innovative, or unexpected ways that you have seen teams applying the Components pattern?What are the most interesting, unexpected, or challenging lessons that you have learned while working on the latest iterations of Dagster?When is Dagster the wrong choice?What do you have planned for the future of Dagster?Contact Info LinkedInParting Question From your perspective, what is the biggest gap in the tooling or technology for data management today?Links Dagster+ EpisodeDagster Components Slide DeckThe Rise Of Medium CodeLakehouse ArchitectureIcebergDagster ComponentsPydantic ModelsKubernetesDagster PipesRuby on RailsdbtSlingFivetranTemporalMCP == Model Context ProtocolThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Summary In this episode of the Data Engineering Podcast Alex Albu, tech lead for AI initiatives at Starburst, talks about integrating AI workloads with the lakehouse architecture. From his software engineering roots to leading data engineering efforts, Alex shares insights on enhancing Starburst's platform to support AI applications, including an AI agent for data exploration and using AI for metadata enrichment and workload optimization. He discusses the challenges of integrating AI with data systems, innovations like SQL functions for AI tasks and vector databases, and the limitations of traditional architectures in handling AI workloads. Alex also shares his vision for the future of Starburst, including support for new data formats and AI-driven data exploration tools.

Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementData migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details.This is a pharmaceutical Ad for Soda Data Quality. Do you suffer from chronic dashboard distrust? Are broken pipelines and silent schema changes wreaking havoc on your analytics? You may be experiencing symptoms of Undiagnosed Data Quality Syndrome — also known as UDQS. Ask your data team about Soda. With Soda Metrics Observability, you can track the health of your KPIs and metrics across the business — automatically detecting anomalies before your CEO does. It’s 70% more accurate than industry benchmarks, and the fastest in the category, analyzing 1.1 billion rows in just 64 seconds. And with Collaborative Data Contracts, engineers and business can finally agree on what “done” looks like — so you can stop fighting over column names, and start trusting your data again.Whether you’re a data engineer, analytics lead, or just someone who cries when a dashboard flatlines, Soda may be right for you. Side effects of implementing Soda may include: Increased trust in your metrics, reduced late-night Slack emergencies, spontaneous high-fives across departments, fewer meetings and less back-and-forth with business stakeholders, and in rare cases, a newfound love of data. Sign up today to get a chance to win a $1000+ custom mechanical keyboard. Visit dataengineeringpodcast.com/soda to sign up and follow Soda’s launch week. It starts June 9th. This episode is brought to you by Coresignal, your go-to source for high-quality public web data to power best-in-class AI products. Instead of spending time collecting, cleaning, and enriching data in-house, use ready-made multi-source B2B data that can be smoothly integrated into your systems via APIs or as datasets. With over 3 billion data records from 15+ online sources, Coresignal delivers high-quality data on companies, employees, and jobs. It is powering decision-making for more than 700 companies across AI, investment, HR tech, sales tech, and market intelligence industries. A founding member of the Ethical Web Data Collection Initiative, Coresignal stands out not only for its data quality but also for its commitment to responsible data collection practices. Recognized as the top data provider by Datarade for two consecutive years, Coresignal is the go-to partner for those who need fresh, accurate, and ethically sourced B2B data at scale. Discover how Coresignal's data can enhance your AI platforms. Visit dataengineeringpodcast.com/coresignal to start your free 14-day trial.Your host is Tobias Macey and today I'm interviewing Alex Albu about how Starburst is extending the lakehouse to support AI workloadsInterview IntroductionHow did you get involved in the area of data management?Can you start by outlining the interaction points of AI with the types of data workflows that you are supporting with Starburst?What are some of the limitations of warehouse and lakehouse systems when it comes to supporting AI systems?What are the points of friction for engineers who are trying to employ LLMs in the work of maintaining a lakehouse environment?Methods such as tool use (exemplified by MCP) are a means of bolting on AI models to systems like Trino. What are some of the ways that is insufficient or cumbersome?Can you describe the technical implementation of the AI-oriented features that you have incorporated into the Starburst platform?What are the foundational architectural modifications that you had to make to enable those capabilities?For the vector storage and indexing, what modifications did you have to make to iceberg?What was your reasoning for not using a format like Lance?For teams who are using Starburst and your new AI features, what are some examples of the workflows that they can expect?What new capabilities are enabled by virtue of embedding AI features into the interface to the lakehouse?What are the most interesting, innovative, or unexpected ways that you have seen Starburst AI features used?What are the most interesting, unexpected, or challenging lessons that you have learned while working on AI features for Starburst?When is Starburst/lakehouse the wrong choice for a given AI use case?What do you have planned for the future of AI on Starburst?Contact Info LinkedInParting Question From your perspective, what is the biggest gap in the tooling or technology for data management today?Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story.Links StarburstPodcast EpisodeAWS AthenaMCP == Model Context ProtocolLLM Tool UseVector EmbeddingsRAG == Retrieval Augmented GenerationAI Engineering Podcast EpisodeStarburst Data ProductsLanceLanceDBParquetORCpgvectorStarburst IcehouseThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Summary In this episode of the Data Engineering Podcast Mai-Lan Tomsen Bukovec, Vice President of Technology at AWS, talks about the evolution of Amazon S3 and its profound impact on data architecture. From her work on compute systems to leading the development and operations of S3, Mylan shares insights on how S3 has become a foundational element in modern data systems, enabling scalable and cost-effective data lakes since its launch alongside Hadoop in 2006. She discusses the architectural patterns enabled by S3, the importance of metadata in data management, and how S3's evolution has been driven by customer needs, leading to innovations like strong consistency and S3 tables.

Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementData migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details.This is a pharmaceutical Ad for Soda Data Quality. Do you suffer from chronic dashboard distrust? Are broken pipelines and silent schema changes wreaking havoc on your analytics? You may be experiencing symptoms of Undiagnosed Data Quality Syndrome — also known as UDQS. Ask your data team about Soda. With Soda Metrics Observability, you can track the health of your KPIs and metrics across the business — automatically detecting anomalies before your CEO does. It’s 70% more accurate than industry benchmarks, and the fastest in the category, analyzing 1.1 billion rows in just 64 seconds. And with Collaborative Data Contracts, engineers and business can finally agree on what “done” looks like — so you can stop fighting over column names, and start trusting your data again.Whether you’re a data engineer, analytics lead, or just someone who cries when a dashboard flatlines, Soda may be right for you. Side effects of implementing Soda may include: Increased trust in your metrics, reduced late-night Slack emergencies, spontaneous high-fives across departments, fewer meetings and less back-and-forth with business stakeholders, and in rare cases, a newfound love of data. Sign up today to get a chance to win a $1000+ custom mechanical keyboard. Visit dataengineeringpodcast.com/soda to sign up and follow Soda’s launch week. It starts June 9th.Your host is Tobias Macey and today I'm interviewing Mai-Lan Tomsen Bukovec about the evolutions of S3 and how it has transformed data architectureInterview IntroductionHow did you get involved in the area of data management?Most everyone listening knows what S3 is, but can you start by giving a quick summary of what roles it plays in the data ecosystem?What are the major generational epochs in S3, with a particular focus on analytical/ML data systems?The first major driver of analytical usage for S3 was the Hadoop ecosystem. What are the other elements of the data ecosystem that helped shape the product direction of S3?Data storage and retrieval have been core primitives in computing since its inception. What are the characteristics of S3 and all of its copycats that led to such a difference in architectural patterns vs. other shared data technologies? (e.g. NFS, Gluster, Ceph, Samba, etc.)How does the unified pool of storage that is exemplified by S3 help to blur the boundaries between application data, analytical data, and ML/AI data?What are some of the default patterns for storage and retrieval across those three buckets that can lead to anti-patterns which add friction when trying to unify those use cases?The age of AI is leading to a massive potential for unlocking unstructured data, for which S3 has been a massive dumping ground over the years. How is that changing the ways that your customers think about the value of the assets that they have been hoarding for so long?What new architectural patterns is that generating?What are the most interesting, innovative, or unexpected ways that you have seen S3 used for analytical/ML/Ai applications?What are the most interesting, unexpected, or challenging lessons that you have learned while working on S3?When is S3 the wrong choice?What do you have planned for the future of S3?Contact Info LinkedInParting Question From your perspective, what is the biggest gap in the tooling or technology for data management today?Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story.Links AWS S3KinesisKafkaSQSEMRDrupalWordpressNetflix Blog on S3 as a Source of TruthHadoopMapReduceNasa JPLFINRA == Financial Industry Regulatory AuthorityS3 Object VersioningS3 Cross RegionS3 TablesIcebergParquetAWS KMSIceberg RESTDuckDBNFS == Network File SystemSambaGlusterFSCephMinIOS3 MetadataPhotoshop Generative FillAdobe FireflyTurbotax AI AssistantAWS Access AnalyzerData ProductsS3 Access PointAWS Nova ModelsLexisNexis ProtegeS3 Intelligent TieringS3 Principal Engineering TenetsThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Patrick Thompson, co-founder of Clarify and former co-founder of Iteratively (acquired by Amplitude), joined Yuliia and Dumky to discuss the evolution from data quality to decision quality. Patrick shares his experience building data contracts solutions at Atlassian and later developing analytics tracking tools. Patrick challenges the assumption that AI will eliminate the need for structured data. He argues that while LLMs excel at understanding unstructured data, businesses still need deterministic systems for automation and decision-making. Patrick shares insights on why enforcing data quality at the source remains critical, even in an AI-first world, and explains his shift from analytics to CRM while maintaining focus on customer data unification and business impact over technical perfectionism.Tune in!

Summary In this episode of the Data Engineering Podcast Chakravarthy Kotaru talks about scaling data operations through standardized platform offerings. From his roots as an Oracle developer to leading the data platform at a major online travel company, Chakravarthy shares insights on managing diverse database technologies and providing databases as a service to streamline operations. He explains how his team has transitioned from DevOps to a platform engineering approach, centralizing expertise and automating repetitive tasks with AWS Service Catalog. Join them as they discuss the challenges of migrating legacy systems, integrating AI and ML for automation, and the importance of organizational buy-in in driving data platform success.

Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementData migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details.This is a pharmaceutical Ad for Soda Data Quality. Do you suffer from chronic dashboard distrust? Are broken pipelines and silent schema changes wreaking havoc on your analytics? You may be experiencing symptoms of Undiagnosed Data Quality Syndrome — also known as UDQS. Ask your data team about Soda. With Soda Metrics Observability, you can track the health of your KPIs and metrics across the business — automatically detecting anomalies before your CEO does. It’s 70% more accurate than industry benchmarks, and the fastest in the category, analyzing 1.1 billion rows in just 64 seconds. And with Collaborative Data Contracts, engineers and business can finally agree on what “done” looks like — so you can stop fighting over column names, and start trusting your data again.Whether you’re a data engineer, analytics lead, or just someone who cries when a dashboard flatlines, Soda may be right for you. Side effects of implementing Soda may include: Increased trust in your metrics, reduced late-night Slack emergencies, spontaneous high-fives across departments, fewer meetings and less back-and-forth with business stakeholders, and in rare cases, a newfound love of data. Sign up today to get a chance to win a $1000+ custom mechanical keyboard. Visit dataengineeringpodcast.com/soda to sign up and follow Soda’s launch week. It starts June 9th.Your host is Tobias Macey and today I'm interviewing Chakri Kotaru about scaling successful data operations through standardized platform offeringsInterview IntroductionHow did you get involved in the area of data management?Can you start by outlining the different ways that you have seen teams you work with fail due to lack of structure and opinionated design?Why NoSQL?Pairing different styles of NoSQL for different problemsUseful patterns for each NoSQL style (document, column family, graph, etc.)Challenges in platform automation and scaling edge casesWhat challenges do you anticipate as a result of the new pressures as a result of AI applications?What are the most interesting, innovative, or unexpected ways that you have seen platform engineering practices applied to data systems?What are the most interesting, unexpected, or challenging lessons that you have learned while working on data platform engineering?When is NoSQL the wrong choice?What do you have planned for the future of platform principles for enabling data teams/data applications?Contact Info LinkedInParting Question From your perspective, what is the biggest gap in the tooling or technology for data management today?Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story.Links RiakDynamoDBSQL ServerCassandraScyllaDBCAP TheoremTerraformAWS Service CatalogBlog PostThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

In todays episode of Data Engineering Central Podcast we talk about a few hot topics, AWS S3 Tables, Databricks raising money, are Data Contracts Dead, and the Lake House Storage Format battle! It's a good one, buckle up!

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit dataengineeringcentral.substack.com/subscribe

Building a Data Mesh The Data Product Management In Action podcast, brought to you by Soda and executive producer Scott Hirleman, is a platform for data product management practitioners to share insights and experiences.  In Episode 23 of Data Product Management in Action, our host Frannie Helforoush is joined by Soheil Mirchi, a technical product manager. Soheil discusses his company’s shift from a centralized data lake to a decentralized data mesh architecture. He outlines the three types of data products—source-aligned, aggregated, and customer-facing—and highlights the importance of data contracts and testing. Learn about strategies for measuring success through metrics and customer feedback, along with lessons on starting small and fostering data democratization. Tune in for essential insights on effective data management! About our host Frannie Helforoush: Frannie's journey began as a software engineer and evolved into a strategic product manager. Now, as a data product manager, she leverages her expertise in both fields to create impactful solutions. Frannie thrives on making data accessible and actionable, driving product innovation, and ensuring product thinking is integral to data management. Connect with Frannie on LinkedIn. About our guest Soheil Mirchi :Soheil is a Technical Product Manager at Temedica, a health insights company focused on transforming complex healthcare and pharmaceutical data into actionable insights. Leading a team of data engineers, scientists, and analysts, Soheil drives the development of cutting-edge data products while guiding the company’s transition to a data mesh architecture. He is passionate about empowering teams with the autonomy to manage their own data products and believes in a collaborative approach to driving innovation in the health tech space. Connect with Soheil on LinkedIn. All views and opinions expressed are those of the individuals and do not necessarily reflect their employers or anyone else.  Join the conversation on LinkedIn.  Apply to be a guest or nominate someone that you know.  Do you love what you're listening to? Please rate and review the podcast, and share it with fellow practitioners you know. Your support helps us reach more listeners and continue providing valuable insights!                

The Data Product Management In Action podcast, brought to you by Soda and executive producer Scott Hirleman, is a platform for data product management practitioners to share insights and experiences. We've released a special edition series of minisodes of our podcast. Recorded live at Data Connect 2024, our host Michael Toland engages in short, sweet, informative, and delightful conversations with five prevelant practitioners who are forging their way forward in data and technology.

About our host Michael Toland: Michael is a Product Management Coach and Consultant with Pathfinder Product, a Test Double Operation. Since 2016, Michael has worked on large-scale system modernizations and migration initiatives at Verizon. Outside his professional career, Michael serves as the Treasurer for the New Leaders Council, mentors with Venture for America, sings with the Columbus Symphony, and writes satire for his blog Dignified Product. He is excited to discuss data product management with the podcast audience. Connect with Michael on LinkedIn About our guest Jean-Georges Perrin: Jean-Georges “jgp” Perrin is the Chief Innovation Officer at AbeaData, where he focuses on developing cutting-edge data tooling. He chairs the Open Data Contract Standard (ODCS) at the Linux Foundation's Bitol project, co-founded the AIDA User Group, and has authored several influential books, including Implementing Data Mesh (O'Reilly) and Spark in Action, 2nd Edition (Manning). With over 25 years in IT, Jean-Georges is recognized as a Lifetime IBM Champion, a PayPal Champion, and a Data Mesh MVP. His expertise spans data engineering, governance, and the industrialization of data science. Outside of tech, he enjoys exploring Upstate New York and New England with his family. Connect with J-GP on LinkedIn.  All views and opinions expressed are those of the individuals and do not necessarily reflect their employers or anyone else. Join the conversation on LinkedIn. Apply to be a guest or nominate a practitioner.  Do you love what you're listening to? Please rate and review the podcast, and share it with fellow practitioners you know. Your support helps us reach more listeners and continue providing valuable insights!

The Data Product Management In Action podcast, brought to you by Soda and executive producer Scott Hirleman, is a platform for data product management practitioners to share insights and experiences. We've released a special edition series of minisodes of our podcast. Recorded live at Data Connect 2024, our host Michael Toland engages in short, sweet, informative, and delightful conversations with five prevelant practitioners who are forging their way forward in data and technology. Recorded on Day 2 of Data Connect 2024, Michael sits down with Kim Theis, CEO Abea Data, for her first appearance at both Data Connect 2024 and in Columbus, Ohio. Kim shares insights from her recent talk at the conference, where she focused on maximizing ROI from data-driven projects—a topic she has extensively researched. Since transitioning from her role at PayPal, Kim has been on a mission to help individuals and organizations better understand the journey toward adopting data products and the tangible benefits they can offer. About our host Michael Toland: Michael is a Product Management Coach and Consultant with Pathfinder Product, a Test Double Operation. Since 2016, Michael has worked on large-scale system modernizations and migration initiatives at Verizon. Outside his professional career, Michael serves as the Treasurer for the New Leaders Council, mentors with Venture for America, sings with the Columbus Symphony, and writes satire for his blog Dignified Product. He is excited to discuss data product management with the podcast audience. Connect with Michael on LinkedIn.  About our guest Kim Theis: Kim is the CEO and co-founder of AbeaData, a company transforming how data is leveraged as a product and driving the success of AI initiatives across industries. With over 20 years of experience, Kim has a proven track record of delivering data innovation and excellence in various leadership roles. Before founding AbeaData, she served as the Head of Intelligence Automation at PayPal, where she reorganized the team to support a decentralized data framework, launched a data mesh in under a year, and led the team that pioneered the world's first open-source Data Contract. Her extensive experience also includes executive data consulting and technology leadership roles within Fortune 500 and FTSE 100 companies. Connect with Kim on LinkedIn.  All views and opinions expressed are those of the individuals and do not necessarily reflect their employers or anyone else. Join the conversation on LinkedIn. Apply to be a guest or nominate a practitioner.  Do you love what you're listening to? Please rate and review the podcast, and share it with fellow practitioners you know. Your support helps us reach more listeners and continue providing valuable insights!

Summary Data contracts are both an enforcement mechanism for data quality, and a promise to downstream consumers. In this episode Tom Baeyens returns to discuss the purpose and scope of data contracts, emphasizing their importance in achieving reliable analytical data and preventing issues before they arise. He explains how data contracts can be used to enforce guarantees and requirements, and how they fit into the broader context of data observability and quality monitoring. The discussion also covers the challenges and benefits of implementing data contracts, the organizational impact, and the potential for standardization in the field.

Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementData lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst is an end-to-end data lakehouse platform built on Trino, the query engine Apache Iceberg was designed for, with complete support for all table formats including Apache Iceberg, Hive, and Delta Lake. Trusted by teams of all sizes, including Comcast and Doordash. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino.At Outshift, the incubation engine from Cisco, they are driving innovation in AI, cloud, and quantum technologies with the powerful combination of enterprise strength and startup agility. Their latest innovation for the AI ecosystem is Motific, addressing a critical gap in going from prototype to production with generative AI. Motific is your vendor and model-agnostic platform for building safe, trustworthy, and cost-effective generative AI solutions in days instead of months. Motific provides easy integration with your organizational data, combined with advanced, customizable policy controls and observability to help ensure compliance throughout the entire process. Move beyond the constraints of traditional AI implementation and ensure your projects are launched quickly and with a firm foundation of trust and efficiency. Go to motific.ai today to learn more!Your host is Tobias Macey and today I'm interviewing Tom Baeyens about using data contracts to build a clearer API for your dataInterview IntroductionHow did you get involved in the area of data management?Can you describe the scope and purpose of data contracts in the context of this conversation?In what way(s) do they differ from data quality/data observability?Data contracts are also known as the API for data, can you elaborate on this?What are the types of guarantees and requirements that you can enforce with these data contracts?What are some examples of constraints or guarantees that cannot be represented in these contracts?Are data contracts related to the shift-left?Data contracts are also known as the API for data, can you elaborate on this?The obvious application of data contracts are in the context of pipeline execution flows to prevent failing checks from propagating further in the data flow. What are some of the other ways that these contracts can be integrated into an organization's data ecosystem?How did you approach the design of the syntax and implementation for Soda's data contracts?Guarantees and constraints around data in different contexts have been implemented in numerous tools and systems. What are the areas of overlap in e.g. dbt, great expectations?Are there any emerging standards or design patterns around data contracts/guarantees that will help encourage portability and integration across tooling/platform contexts?What are the most interesting, innovative, or unexpected ways that you have seen data contracts used?What are the most interesting, unexpected, or challenging lessons that you have learned while working on data contracts at Soda?When are data contracts the wrong choice?What do you have planned for the future of data contracts?Contact Info LinkedInParting Question From your perspective, what is the biggest gap in the tooling or technology for data management today?Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story.Links SodaPodcast EpisodeJBossData ContractAirflowUnit TestingIntegration TestingOpenAPIGraphQLCircuit Breaker PatternSodaCLSoda Data ContractsData MeshGreat Expectationsdbt Unit TestsOpen Data ContractsODCS == Open Data Contract StandardODPS == Open Data Product SpecificationThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Jean-Georges Perrin is a serial startup founder, currently co-founder of AbeaData [https://abeadata.com/], and co-author of "Implementing Data Mesh." He is the one who championed the PayPal's data contract project, which is now part of Bitol and the Linux Foundation. In this episode, JGP speaks about building and maintaining open-source data contract solutions using open standards. He shares a lot about why and how he came to it and the challenges of maintaining it to avoid appropriation of the solution. JGP discusses how they balance the interests of different groups in developing a community around open data contract standards. More importantly, he shares how data contracts can positively change the life of every data engineer.Check out JGP's LinkedInCheck out Bitol -  Open Standards for Data Contracts and become a contributor.

Andrew Jones, principal engineer at GoCardless, is the author of the book "Driving Data Quality with Data Contracts." During this session, we talked a lot about what a data platform is, who data platform engineers are, what it takes to make a data platform reliable, and, most importantly, how Andrew and his team managed to build a reliable platform at GoCardless. Sure enough, we touched a little on data contracts, their implementation, and the possibility of vendors doing the same as Andrew's team did.Andrew's LinkedIn - https://www.linkedin.com/in/andrewrhysjones/

Send us a text Welcome to the cozy corner of the tech world where ones and zeros mingle with casual chit-chat. Datatopics Unplugged is your go-to spot for relaxed discussions around tech, news, data, and society. Dive into conversations that should flow as smoothly as your morning coffee (but don't), where industry insights meet laid-back banter. Whether you're a data aficionado or just someone curious about the digital age, pull up a chair, relax, and let's get into the heart of data, unplugged style! In episode #41, titled “Regulations and Revelations: Rust Safety, ChatGPT Secrets, and Data Contracts” we're thrilled to have Paolo Léonard joining us as we unpack a host of intriguing developments across the tech landscape: In Rust We Trust? White House Office urges memory safety: A dive into the push for memory safety and what it means for programming languages like Python.ChatGPT's Accidental Leak? Did OpenAI just accidentally leak the next big ChatGPT upgrade?: Speculations on the upcoming enhancements and their impact on knowledge accessibility.EU AI Act Adoption: EU Parliament officially adopts AI Act: Exploring the landmark AI legislation and its broad effects, with a critical look at potential human rights concerns.Meet Devin, the AI Engineer: Exploring the capabilities and potential of the first AI software engineer.Rye's New Stewardship: Astral takes stewardship of Rye: The next big thing in Python packaging and the role of community in driving innovation, with discussions unfolding on GitHub.Data Contract CLI: A look at data contracts and their importance in managing and understanding data across platforms.AI and Academic Papers: The influence of AI on academic research, highlighted by this paper and this paper, and how it's reshaping the landscape of knowledge sharing.

Send us a text Welcome to another engaging episode of Datatopics Unplugged, the podcast where tech and relaxation intersect. Today, we're excited to host two special guests, Paolo and Tim, who bring their unique perspectives to our cozy corner. Guests of Today Paolo: An enthusiast of fantasy and sci-fi reading, Paolo is on a personal mission to reduce his coffee consumption. He has a unique way of measuring his height, at 0.89 Sams tall. With over two and a half years of experience as a data engineer at dataroots, Paolo contributes a rich professional perspective. His hobbies extend to playing field hockey and a preference for the warmer summer season.Tim: Occasionally known as Dr. Dunkenstein, Tim brings a mix of humor and insight. He measures his height at 0.87 Sams tall. As the Head of Bizdev, he prefers to steer clear of grand titles, revealing his views on hierarchical structures and monarchies.Topics Biz Corner: Kyutai: We delve into France's answer to OpenAI with Paolo Leonard, exploring the implications and future of Kyutai: https://techcrunch.com/2023/11/17/kyutai-is-an-french-ai-research-lab-with-a-330-million-budget-that-will-make-everything-open-source/GPT-NL: A discussion led by Bart Smeets on the Netherlands' own open language model and its potential impact: https://www.computerweekly.com/news/366558412/Netherlands-starts-building-its-own-AI-language-modelTech Corner: Data Quality Insights: A blog post by Paolo on data quality vs. data validation. We'll explore when and why data quality is essential, and evaluate tools like dbt, soda, deequ, and great_expectations: https://dataroots.io/blog/state-of-data-quality-october-2023Soda Data Contracts: An overview of the newly released OSS Data Contract Engine by Soda. https://docs.soda.io/soda/data-contracts.htmlFood for Thought Corner: Hare - A 100-Year Programming Language: Bart starts a discussion on the ambition of Hare to remain relevant for a century: https://harelang.org/blog/2023-11-08-100-year-language/.Join us for this mix of expert insights and light-hearted moments. Whether you're deeply embedded in the tech world or just dipping your toes in, this episode promises to be both informative and entertaining!

And, yes. There is a voucher, go to dataroots.io and navigate to the shop (top right) and use voucher code murilos_bargain_blast for a 25EUR discount!

Uncover the transformative power of decentralising data with host Jason Foster and Nachiket Mehta, Head of Data and Analytics Engineering at Wayfair. They delve into Wayfair's journey from a centralised data approach to a decentralised model, the challenges they encountered, the value created, and how it has led to faster and more informed decision-making. Explore how Wayfair is utilising domain models, data contracts, and a platform-based approach to empower teams and enhance overall performance, and learn how decentralisation can revolutionise your organisation's data practices and drive business growth.

podcast_episode
by Joe Reis (DeepLearning.AI)

Boring is back. As technology makes the lives of data engineers easier with respect to solving classical data problems, data engineers can now move to tackle "boring" problems like data contracts, semantics, and higher-level and value-add tasks. This also sets us up to tackle the next generation of data problems, namely integrating ML and AI into every business workflow. Boring is good.

If you find yourself in a situation where you don't have a college degree, I want you to know it is possible to land a data job.

In this episode, Teshawn Black shares his experience landing data contracts at Google & LinkedIn despite having no certifications.

🌟 Join the data project club!

“25OFF” to get 25% off (first 50 members).

📊 Come to my next free “How to Land Your First Data Job” training

🏫 Check out my 10-week data analytics bootcamp

Teshawn’s Links:

https://www.tiktok.com/@the6figureanalyst

https://app.mediakits.com/the6figureanalyst

Timestamps:

(6:29) - Providing value is KEY! (10:49) - We don’t have tomorrow, so live today (13:43) - Understand what company looks for (17:05) - Learn to communicate effectively (19:14) - Working contract jobs is a blessing (20:17) - Always market yourself (22:33) - Ask the budget first (25:07) - Is job security really SECURE? (27:35) - We are all expendable (29:32) - Focus on data cleaning & data manipulation (30:42) - Don’t skip Excel!

Connect with Avery:

📺 Subscribe on YouTube

🎙Listen to My Podcast

👔 Connect with me on LinkedIn

📸 Instagram

🎵 TikTok

Mentioned in this episode: Join the last cohort of 2025! The LAST cohort of The Data Analytics Accelerator for 2025 kicks off on Monday, December 8th and enrollment is officially open!

To celebrate the end of the year, we’re running a special End-of-Year Sale, where you’ll get: ✅ A discount on your enrollment 🎁 6 bonus gifts, including job listings, interview prep, AI tools + more

If your goal is to land a data job in 2026, this is your chance to get ahead of the competition and start strong.

👉 Join the December Cohort & Claim Your Bonuses: https://DataCareerJumpstart.com/daa https://www.datacareerjumpstart.com/daa

Welcome to today's podcast on data contracts for data leaders and executives. Data contracts are a critical component of data management and are essential for any organization that collects, processes, or analyzes data. This podcast will explore data contracts, their importance, and how data leaders and executives can implement them in their organizations. To begin with, let's define what we mean by data contracts. A data contract is a formal agreement between the data provider and the data consumer that specifies the terms and conditions under which the data will be shared, used, and protected. The data contract outlines the obligations and responsibilities of both parties. It clearly explains how the data will be managed, stored, and analyzed.