talk-data.com talk-data.com

Topic

Data Quality

data_management data_cleansing data_validation

537

tagged

Activity Trend

82 peak/qtr
2020-Q1 2026-Q1

Activities

537 activities · Newest first

Building Production-Ready Recommender Systems with Feature Stores

Recommender systems are highly prevalent in modern applications and services but are notoriously difficult to build and maintain. Organizations face challenges such as complex data dependencies, data leakage, and frequently changing data/models. These challenges are compounded when building, deploying, and maintaining ML pipelines spans data scientists and engineers. Feature stores help address many of the operational challenges associated with recommender systems.

In this talk, we explore:

  • Challenges of building recommender systems
  • Strategies for reducing latency, while balancing requirements for freshness
  • Challenges in mitigating data quality issues
  • Technical and organizational challenges feature stores solve
  • How to integrate Feast, an open-source feature store, into an existing recommender system to support production systems

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Cleanlab: AI to Find and Fix Errors in ML Datasets

Real-world datasets have a large fraction of errors, which negatively impacts model quality and benchmarking. This talk presents Cleanlab, an open-source tool that addresses these issues using the latest research in data-centric AI. Cleanlab has been used to improve datasets at a number of Fortune 500 companies.

Ontological issues, invalid data points, and label errors are pervasive in datasets. Even gold-standard ML datasets have on average 3.3% label errors (labelerrors.com). Data errors degrade model quality, and errors lead to incorrect conclusions about model performance and suboptimal models being deployed.

We present the cleanlab open-source package (github.com/cleanlab/cleanlab) for finding and fixing data errors. We will walk through using Cleanlab to fix errors in a real-world dataset, with an end-to-end demo of how Cleanlab improves data and model performance.

Finally, we will show Cleanlab Studio, which provides a web interface for human-in-the-loop data quality control.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Enabling BI in a Lakehouse Environment: How Spark and Delta Can Help With Automating a DWH Develop

Traditional data warehouses typically struggle when it comes to handling large volumes of data and traffic, particularly when it comes to unstructured data. In contrast, data lakes overcome such issues and have become the central hub for storing data. We outline how we can enable BI Kimball data modelling in a Lakehouse environment.

We present how we built a Spark-based framework to modernize DWH development with data lake as central storage, assuring high data quality and scalability. The framework was implemented at over 15 enterprise data warehouses across Europe.

We present how one can tackle in Spark & with Delta Lake the data warehouse principles like surrogate, foreign and business keys, SCD type 1 and 2 etc. Additionally, we share our experiences on how such a unified data modelling framework can bridge BI with modern day use cases, such as machine learning and real time analytics. The session outlines the original challenges, the steps taken and the technical hurdles we faced.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Discover Data Lakehouse With End-to-End Lineage

Data Lineage is key for managing change, ensuring data quality and implementing Data Governance in an organization. There are a few use cases for Data Lineage: Data Governance: For compliance and regulatory purposes our customers are required to prove the data/reports they are submitting came from a trusted and verified source.

This typically means identifying the tables and data sets used in a report or dashboard and tracing the source of these tables and fields. Another use case for the Governance scenario is to understand the spread of sensitive data within the lakehouse. Data Discovery: Data analysts looking to self-serve and build their own analytics and models typically spend time exploring and understanding the data in their lakehouse.

Lineage is a key piece of information which enhances the understanding and trustworthiness of the data the analyst plans to use. Problem Identification: Data teams are often called to solve errors in analysts dashboards and reports (“Why is the total number of widgets different in this report than the one I have built?”). This usually leads to an expensive forensic exercise by the DE team to understand the sources of data and the transformations applied to it before it hits the report. Change Management : It is not uncommon for data sources to change, a new source may stop delivering data or a field in the source system changes its semantics.

In this scenario the DE team would like to understand the downstream impact of this change - to get a sense of how many datasets and users will be affected by this change. This will help them determine the impact of the change, manage user expectations and address issues ahead of time In this talk, we will talk about how we capture table and column lineage for spark / delta and unity catalog for our customers in details and how users could leverage data lineage to serve various use cases mentioned above.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

ÀLaSpark: Gousto's Recipe for Building Scalable PySpark Pipelines

Find out how Gousto is developing its data pipelines at scale in a repeatable manner. At Gousto, we’ve developed Goustospark - a wrapper around pyspark that allows us to quickly and easily build data pipelines that are deployed into our Databricks environment.

This wrapper abstracts repetitive components of all data pipelines such as spark configurations and metastore interactions. This allows a developer to simply specify the blueprints of the pipeline before turning their attention to more pressing issues, such as data quality and data governance, whilst enjoying a high level of performance and reliability.

In this session we will deep dive into the design patterns we followed, some unique approaches we’ve taken on how we structure pipelines and show a live demo of implementing a new spark streaming pipeline in Databricks from scratch. We will even share some example python code and snippets to help you build your own.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Best Practices of Maintaining High-Quality Data

Data sits at the heart of machine learning algorithms and makes your model only as good as the data governance policies at the organization. The talk will cover multiple data governance frameworks. Besides, we will talk in depth about one of the key areas of the data governance policy i.e. data quality. The session will cover the significance of the data quality, the definition of goodness, what are the key benefits and impact of maintaining high quality data and processes. Not merely a theoretical aspect, the talk focusses on the practical techniques and guidelines on maintaining the data quality.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Computational Data Governance at Scale

This talk is about the implementation of a Data Mesh in a Fozzy Group. In our experience, the biggest bottleneck in transition to Data Mesh is unclear data ownership. This and other issues can be solved with (federated) computational data governance. We will go through the process of building a global data lineage with 200k tables, 40k table replications, and 70k SQL stored procedures. Also, we will cover our lessons from building data product culture with explicit and automated tracking of ownership and data quality. Fozzy Group is a holding company that comprises about 40 different businesses with 60k employees in various domains: retail, banking, insurance, logistics, agriculture, HoReCa, E-Commerce, etc.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Git for Data Lakes—How lakeFS Scales Data Versioning to Billions of Objects

Modern data lake architectures rely on object storage as the single source of truth. We use them to store an increasing amount of data, which is increasingly complex and interconnected. While scalable, these object stores provide little safety guarantees: lacking semantics that allow atomicity, rollbacks, and reproducibility capabilities needed for data quality and resiliency.

lakeFS - an open source data version control system designed for Data Lakes solves these problems by introducing concepts borrowed from Git: branching, committing, merging and rolling back changes to data.

In this talk you'll learn about the challenges with using object storage for data lakes and how lakeFS enables you to solve them.

By the end of the session you’ll understand how lakeFS scales its Git-like data model to petabytes of data, across billions of objects - without affecting throughput or performance. We will also demo branching, writing data using Spark and merging it on a billion-object repository.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Data centric AI development  From Big Data to Good Data   Andrew Ng

Data-centric AI is a growing movement which shifts the engineering focus in AI systems from the model to the data. However, Data-centric AI faces many open challenges, including measuring data quality, data iteration and engineering data as part of the ML project workflow, data management tools, crowdsourcing, data augmentation & data synthesis as well as responsible AI. This talk names the key pillars of Data-centric AI, identifies the trends in Data-centric AI movement, and sets a vision for taking ideas applied intuitively by a handful of experts and synthesizing them into tools that make the application systematic for all.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Summary There are extensive and valuable data sets that are available outside the bounds of your organization. Whether that data is public, paid, or scraped it requires investment and upkeep to acquire and integrate it with your systems. Crux was built to reduce the total cost of acquisition and ownership for integrating external data, offering a fully managed service for delivering those data assets in the manner that best suits your infrastructure. In this episode Crux CTO Mark Etherington discusses the different costs involved in managing external data, how to think about the total return on investment for your data, and how the Crux platform is architected to reduce the toil involved in managing third party data.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show! Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan’s active metadata capabilities. Push information about data freshness and quality to your business intelligence, automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans can focus on delivering real value. Go to dataengineeringpodcast.com/atlan today to learn more about how Atlan’s active metadata platform is helping pioneering data teams like Postman, Plaid, WeWork & Unilever achieve extraordinary things with metadata and escape the chaos. Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it’s often too late and damage is done. Datafold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests. Datafold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. No more shipping and praying, you can now know exactly what will change in your database! Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Visit dataengineeringpodcast.com/datafold today to book a demo with Datafold. Tired of deploying bad data? Need to automate data pipelines with less red tape? Shipyard is the premier data orchestration platform built to help your data team quickly launch, monitor, and share workflows in a matter of minutes. Build powerful workflows that connect your entire data stack end-to-end with a mix of your code and their open-source, low-code templates. Once launched, Shipyard makes data observability easy with logging, alerting, and retries that will catch errors before your business team does. So whether you’re ingesting data from an API, transforming it with dbt, updating BI tools, or sending data alerts, Shipyard centralizes these operations and handles the heavy lifting so your data team can finally focus on what they’re good at — solving problems with data. Go to dataengineeringpodcast.com/shipyard to get started automating with their free developer plan today! Your host is Tobias M

Summary Building a data platform is a journey, not a destination. Beyond the work of assembling a set of technologies and building integrations across them, there is also the work of growing and organizing a team that can support and benefit from that platform. In this episode Inbar Yogev and Lior Winner share the journey that they and their teams at Riskified have been on for their data platform. They also discuss how they have established a guild system for training and supporting data professionals in the organization.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show! Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan’s active metadata capabilities. Push information about data freshness and quality to your business intelligence, automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans can focus on delivering real value. Go to dataengineeringpodcast.com/atlan today to learn more about how Atlan’s active metadata platform is helping pioneering data teams like Postman, Plaid, WeWork & Unilever achieve extraordinary things with metadata and escape the chaos. Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it’s often too late and damage is done. Datafold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests. Datafold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. No more shipping and praying, you can now know exactly what will change in your database! Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Visit dataengineeringpodcast.com/datafold today to book a demo with Datafold. Tired of deploying bad data? Need to automate data pipelines with less red tape? Shipyard is the premier data orchestration platform built to help your data team quickly launch, monitor, and share workflows in a matter of minutes. Build powerful workflows that connect your entire data stack end-to-end with a mix of your code and their open-source, low-code templates. Once launched, Shipyard makes data observability easy with logging, alerting, and retries that will catch errors before your business team does. So whether you’re ingesting data from an API, transforming it with dbt, updating BI tools, or sending data alerts, Shipyard centralizes these operations and handles the heavy lifting so your data team can finally focus on what they’re good at — solving problems with data. Go to dataengineeringpodcast.com/shipyard to get started automating with their free developer plan today! Your host is Tobias Macey and today I’m interviewing Inbar Yogev and Lior Winner about the data platform that the team at Riskified are building to power their fraud management service

Interview

Introduction How did

Summary The ecosystem for data tools has been going through rapid and constant evolution over the past several years. These technological shifts have brought about corresponding changes in data and platform architectures for managing data and analytical workflows. In this episode Colleen Tartow shares her insights into the motivating factors and benefits of the most prominent patterns that are in the popular narrative; data mesh and the modern data stack. She also discusses her views on the role of the data lakehouse as a building block for these architectures and the ongoing influence that it will have as the technology matures.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show! Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan’s active metadata capabilities. Push information about data freshness and quality to your business intelligence, automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans can focus on delivering real value. Go to dataengineeringpodcast.com/atlan today to learn more about how Atlan’s active metadata platform is helping pioneering data teams like Postman, Plaid, WeWork & Unilever achieve extraordinary things with metadata and escape the chaos. Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it’s often too late and damage is done. Datafold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests. Datafold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. No more shipping and praying, you can now know exactly what will change in your database! Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Visit dataengineeringpodcast.com/datafold today to book a demo with Datafold. Tired of deploying bad data? Need to automate data pipelines with less red tape? Shipyard is the premier data orchestration platform built to help your data team quickly launch, monitor, and share workflows in a matter of minutes. Build powerful workflows that connect your entire data stack end-to-end with a mix of your code and their open-source, low-code templates. Once launched, Shipyard makes data observability easy with logging, alerting, and retries that will catch errors before your business team does. So whether you’re ingesting data from an API, transforming it with dbt, updating BI tools, or sending data alerts, Shipyard centralizes these operations and handles the heavy lifting so your data team can finally focus on what they’re good at — solving problems with data. Go to dataengineeringpodcast.com/shipyard to get started automating with their free developer plan today! Your host is Tobias Macey and today I’m interviewing Colleen Tartow about her views on the forces shaping th

Testing is an important part of the DataOps life cycle, giving teams confidence in the integrity of their data as it moves downstream to production systems. But what happens when testing doesn’t catch all of your bad data and “unknown unknown” data quality issues fall through the cracks? Fortunately, data engineers can apply a thing or two from DevOps best practices to tackle data quality at scale with circuit breakers, a novel approach to stopping bad data from actually entering your pipelines in the first place. In this talk, Prateek Chawla, Founding Team Member and Technical Lead at Monte Carlo, will discuss what circuit breakers are, how to integrate them with your Airflow DAGs, and what this looks like in practice. Time permitting, Prateek will also walk through how to build and automate Airflow circuit breakers across multiple cascading pipelines with Python and other common tools.

Have data quality issues? What about reliability problems? You may be hearing a lot of these terms, along with many others, that describe issues you face with your data. What’s the difference, which are you suffering from, and how do you tackle both? Knowing that your Airflow DAGs are green is not enough. It’s time to focus on data reliability and quality measurements to build trust in your data platform. Join this session to learn how Databand’s proactive data observability platform makes it easy to achieve trusted data in your pipelines. In this session we’ll cover: The core differences between data reliability vs data quality What role the data platform team, analysts, and scientists play in guaranteeing data quality Hands-on demo setting up Airflow reliability tracking

Broken data is costly, time-consuming, and nowadays, an all-too-common reality for even the most advanced data teams. In this talk, I’ll introduce this problem, called “data downtime” — periods of time when data is partial, erroneous, missing or otherwise inaccurate — and discuss how to eliminate it in your data ecosystem with end-to-end data observability. Drawing corollaries to application observability in software engineering, data observability is a critical component of the modern DataOps workflow and the key to ensuring data trust at scale. I’ll share why data observability matters when it comes to building a better data quality strategy and highlight tactics you can use to address it today.

We the Data Engineering Team here at WB Games implemented an internal Redshift Loader DAG(s) on Airflow that allow us to ingest data in near real-time at scale into Redshift, taking into account variable load on the DB and been able to quickly catch up data loads in case of various DB outages or high usage scenarios. Highlights: Handle any type of Redshift outages and system delays dynamically between multiple sources(S3) to sinks(Redshift). Auto tuning data copies for faster data backfill in case of delay without overwhelming commit queue. Supports schema evolution on Game data dynamically. Maintain data quality to ensure we do not create data gaps or dupes. Provide embedded custom metrics for deeper insights and anomaly detection. Airflow config based Declarative Dag implementation.

“Why is my data missing?” “Why didn’t my Airflow job run?” “What happened to this report?” If you’ve been on the receiving end of any of these questions, you’re not alone. As data pipelines become increasingly complex and companies ingest more and more data, data engineers are on the hook for troubleshooting where, why, and how data quality issues occur, and most importantly, fixing them so systems can get up and running again. In this talk, Francisco Alberini, Monte Carlo’s first product hire, discusses the three primary factors that contribute to data quality issues and how data teams can leverage Airflow, dbt, and other solutions in their arsenal to conduct root cause analysis on their data pipelines.

Recently there has been much discussion around data monitoring, particularly in regards to reducing time to mitigate data quality problems once they’ve been detected. The problem with reactive or periodic monitoring as the de-facto standard for maintaining data quality is that it’s expensive. By the time a data problem has been identified, it’s effects may have been amplified across a myriad of downstream consumers, leaving you (a data engineer) with a big mess to clean-up. In this talk, we will present an approach for proactively addressing data quality problems using orchestration based on a central metadata graph. Specifically, we will walk through use cases highlighting how the open source metadata platform DataHub can enable proactive pipeline circuit-breaking by serving as the source of truth for both technical & semantic health status for a pipeline’s data dependencies. We’ll share practical recipes for how three powerful open source projects can be combined to build reliable data pipelines: Great Expectations for generating technical health signals in the form of assertion results on datasets, DataHub for providing a semantic identity for a dataset, including ownership, compliance, & lineage, and Airflow for orchestration.

It’s tempting to dismiss observability as another overused buzzword. But this emerging discipline offers substantive methods for enterprises to monitor and optimize business metrics, IT operations, data pipelines, machine learning models, and data quality. Published at: https://www.eckerson.com/articles/the-five-shades-of-observability-business-operations-pipelines-models-and-data-quality

Summary The proliferation of sensors and GPS devices has dramatically increased the number of applications for spatial data, and the need for scalable geospatial analytics. In order to reduce the friction involved in aggregating disparate data sets that share geographic similarities the Unfolded team built a platform that supports working across raster, vector, and tabular data in a single system. In this episode Isaac Brodsky explains how the Unfolded platform is architected, their experience joining the team at Foursquare, and how you can start using it for analyzing your spatial data today.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show! Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan’s active metadata capabilities. Push information about data freshness and quality to your business intelligence, automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans can focus on delivering real value. Go to dataengineeringpodcast.com/atlan today to learn more about how Atlan’s active metadata platform is helping pioneering data teams like Postman, Plaid, WeWork & Unilever achieve extraordinary things with metadata and escape the chaos. Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it’s often too late and damage is done. Datafold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests. Datafold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. No more shipping and praying, you can now know exactly what will change in your database! Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Visit dataengineeringpodcast.com/datafold today to book a demo with Datafold. Unstruk is the DataOps platform for your unstructured data. The options for ingesting, organizing, and curating unstructured files are complex, expensive, and bespoke. Unstruk Data is changing that equation with their platform approach to manage your unstructured assets. Built to handle all of your real-world data, from videos and images, to 3d point clouds and geospatial records, to industry specific file formats, Unstruk streamlines your workflow by converting human hours into machine minutes, and automatically alerting you to insights found in your dark data. Unstruk handles data versioning, lineage tracking, duplicate detection, consistency validation, as well as enrichment through sources including machine learning models, 3rd party data, and web APIs. Go to dataengineeringpodcast.com/unstruk today to transform your messy collection of unstructured data files into actionable assets that power your business. Your host is Tobias Macey and today I’m interviewing Isaac Brodsky about Foursquare’s Unfolded platform for working w