Data Quality

Building Production-Ready Recommender Systems with Feature Stores

2022-07-19 · Databricks DATA + AI Summit 2023 Watch

video

AI/ML Databricks

Recommender systems are highly prevalent in modern applications and services but are notoriously difficult to build and maintain. Organizations face challenges such as complex data dependencies, data leakage, and frequently changing data/models. These challenges are compounded when building, deploying, and maintaining ML pipelines spans data scientists and engineers. Feature stores help address many of the operational challenges associated with recommender systems.

In this talk, we explore:

Challenges of building recommender systems
Strategies for reducing latency, while balancing requirements for freshness
Challenges in mitigating data quality issues
Technical and organizational challenges feature stores solve
How to integrate Feast, an open-source feature store, into an existing recommender system to support production systems

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Cleanlab: AI to Find and Fix Errors in ML Datasets

2022-07-19 · Databricks DATA + AI Summit 2023 Watch

video

AI/ML Databricks GitHub

Real-world datasets have a large fraction of errors, which negatively impacts model quality and benchmarking. This talk presents Cleanlab, an open-source tool that addresses these issues using the latest research in data-centric AI. Cleanlab has been used to improve datasets at a number of Fortune 500 companies.

Ontological issues, invalid data points, and label errors are pervasive in datasets. Even gold-standard ML datasets have on average 3.3% label errors (labelerrors.com). Data errors degrade model quality, and errors lead to incorrect conclusions about model performance and suboptimal models being deployed.

We present the cleanlab open-source package (github.com/cleanlab/cleanlab) for finding and fixing data errors. We will walk through using Cleanlab to fix errors in a real-world dataset, with an end-to-end demo of how Cleanlab improves data and model performance.

Finally, we will show Cleanlab Studio, which provides a web interface for human-in-the-loop data quality control.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Enabling BI in a Lakehouse Environment: How Spark and Delta Can Help With Automating a DWH Develop

2022-07-19 · Databricks DATA + AI Summit 2023 Watch

video

AI/ML Analytics BI Data Lake Data Lakehouse Data Modelling Databricks Delta DWH dimensional modeling Spark

Traditional data warehouses typically struggle when it comes to handling large volumes of data and traffic, particularly when it comes to unstructured data. In contrast, data lakes overcome such issues and have become the central hub for storing data. We outline how we can enable BI Kimball data modelling in a Lakehouse environment.

We present how we built a Spark-based framework to modernize DWH development with data lake as central storage, assuring high data quality and scalability. The framework was implemented at over 15 enterprise data warehouses across Europe.

We present how one can tackle in Spark & with Delta Lake the data warehouse principles like surrogate, foreign and business keys, SCD type 1 and 2 etc. Additionally, we share our experiences on how such a unified data modelling framework can bridge BI with modern day use cases, such as machine learning and real time analytics. The session outlines the original challenges, the steps taken and the technical hurdles we faced.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Discover Data Lakehouse With End-to-End Lineage

2022-07-19 · Databricks DATA + AI Summit 2023 Watch

video

Analytics Dashboard Data Governance Data Lakehouse Databricks Delta Spark

Data Lineage is key for managing change, ensuring data quality and implementing Data Governance in an organization. There are a few use cases for Data Lineage: Data Governance: For compliance and regulatory purposes our customers are required to prove the data/reports they are submitting came from a trusted and verified source.

This typically means identifying the tables and data sets used in a report or dashboard and tracing the source of these tables and fields. Another use case for the Governance scenario is to understand the spread of sensitive data within the lakehouse. Data Discovery: Data analysts looking to self-serve and build their own analytics and models typically spend time exploring and understanding the data in their lakehouse.

Lineage is a key piece of information which enhances the understanding and trustworthiness of the data the analyst plans to use. Problem Identification: Data teams are often called to solve errors in analysts dashboards and reports (“Why is the total number of widgets different in this report than the one I have built?”). This usually leads to an expensive forensic exercise by the DE team to understand the sources of data and the transformations applied to it before it hits the report. Change Management : It is not uncommon for data sources to change, a new source may stop delivering data or a field in the source system changes its semantics.

In this scenario the DE team would like to understand the downstream impact of this change - to get a sense of how many datasets and users will be affected by this change. This will help them determine the impact of the change, manage user expectations and address issues ahead of time In this talk, we will talk about how we capture table and column lineage for spark / delta and unity catalog for our customers in details and how users could leverage data lineage to serve various use cases mentioned above.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

ÀLaSpark: Gousto's Recipe for Building Scalable PySpark Pipelines

2022-07-19 · Databricks DATA + AI Summit 2023 Watch

video

Data Governance Databricks PySpark Python Spark Data Streaming

Find out how Gousto is developing its data pipelines at scale in a repeatable manner. At Gousto, we’ve developed Goustospark - a wrapper around pyspark that allows us to quickly and easily build data pipelines that are deployed into our Databricks environment.

This wrapper abstracts repetitive components of all data pipelines such as spark configurations and metastore interactions. This allows a developer to simply specify the blueprints of the pipeline before turning their attention to more pressing issues, such as data quality and data governance, whilst enjoying a high level of performance and reliability.

In this session we will deep dive into the design patterns we followed, some unique approaches we’ve taken on how we structure pipelines and show a live demo of implementing a new spark streaming pipeline in Databricks from scratch. We will even share some example python code and snippets to help you build your own.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Best Practices of Maintaining High-Quality Data

2022-07-19 · Databricks DATA + AI Summit 2023 Watch

video

AI/ML Data Governance Databricks

Data sits at the heart of machine learning algorithms and makes your model only as good as the data governance policies at the organization. The talk will cover multiple data governance frameworks. Besides, we will talk in depth about one of the key areas of the data governance policy i.e. data quality. The session will cover the significance of the data quality, the definition of goodness, what are the key benefits and impact of maintaining high quality data and processes. Not merely a theoretical aspect, the talk focusses on the practical techniques and guidelines on maintaining the data quality.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Computational Data Governance at Scale

2022-07-19 · Databricks DATA + AI Summit 2023 Watch

video

Data Governance Databricks SQL

This talk is about the implementation of a Data Mesh in a Fozzy Group. In our experience, the biggest bottleneck in transition to Data Mesh is unclear data ownership. This and other issues can be solved with (federated) computational data governance. We will go through the process of building a global data lineage with 200k tables, 40k table replications, and 70k SQL stored procedures. Also, we will cover our lessons from building data product culture with explicit and automated tracking of ownership and data quality. Fozzy Group is a holding company that comprises about 40 different businesses with 60k employees in various domains: retail, banking, insurance, logistics, agriculture, HoReCa, E-Commerce, etc.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Git for Data Lakes—How lakeFS Scales Data Versioning to Billions of Objects

2022-07-19 · Databricks DATA + AI Summit 2023 Watch

video

Data Lake Data Modelling Databricks Git Spark

Modern data lake architectures rely on object storage as the single source of truth. We use them to store an increasing amount of data, which is increasingly complex and interconnected. While scalable, these object stores provide little safety guarantees: lacking semantics that allow atomicity, rollbacks, and reproducibility capabilities needed for data quality and resiliency.

lakeFS - an open source data version control system designed for Data Lakes solves these problems by introducing concepts borrowed from Git: branching, committing, merging and rolling back changes to data.

In this talk you'll learn about the challenges with using object storage for data lakes and how lakeFS enables you to solve them.

By the end of the session you’ll understand how lakeFS scales its Git-like data model to petabytes of data, across billions of objects - without affecting throughput or performance. We will also demo branching, writing data using Spark and merging it on a billion-object repository.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Data centric AI development From Big Data to Good Data Andrew Ng

2022-07-19 · Databricks DATA + AI Summit 2023 Watch

video

by Andrew Ng

AI/ML Big Data Data Management Databricks

Data-centric AI is a growing movement which shifts the engineering focus in AI systems from the model to the data. However, Data-centric AI faces many open challenges, including measuring data quality, data iteration and engineering data as part of the ML project workflow, data management tools, crowdsourcing, data augmentation & data synthesis as well as responsible AI. This talk names the key pillars of Data-centric AI, identifies the trends in Data-centric AI movement, and sets a vision for taking ideas applied intuitively by a handful of experts and synthesizing them into tools that make the application systematic for all.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Making The Total Cost Of Ownership For External Data Manageable With Crux

2022-07-17 · Data Engineering Podcast Listen

podcast_episode

by Mark Etherington (Crux) , Tobias Macey

Airflow Analytics API BI CI/CD Data Engineering Data Management Datafold dbt Kubernetes MongoDB MySQL +2 more

Summary There are extensive and valuable data sets that are available outside the bounds of your organization. Whether that data is public, paid, or scraped it requires investment and upkeep to acquire and integrate it with your systems. Crux was built to reduce the total cost of acquisition and ownership for integrating external data, offering a fully managed service for delivering those data assets in the manner that best suits your infrastructure. In this episode Crux CTO Mark Etherington discusses the different costs involved in managing external data, how to think about the total return on investment for your data, and how the Crux platform is architected to reduce the toil involved in managing third party data.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show! Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan’s active metadata capabilities. Push information about data freshness and quality to your business intelligence, automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans can focus on delivering real value. Go to dataengineeringpodcast.com/atlan today to learn more about how Atlan’s active metadata platform is helping pioneering data teams like Postman, Plaid, WeWork & Unilever achieve extraordinary things with metadata and escape the chaos. Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it’s often too late and damage is done. Datafold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests. Datafold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. No more shipping and praying, you can now know exactly what will change in your database! Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Visit dataengineeringpodcast.com/datafold today to book a demo with Datafold. Tired of deploying bad data? Need to automate data pipelines with less red tape? Shipyard is the premier data orchestration platform built to help your data team quickly launch, monitor, and share workflows in a matter of minutes. Build powerful workflows that connect your entire data stack end-to-end with a mix of your code and their open-source, low-code templates. Once launched, Shipyard makes data observability easy with logging, alerting, and retries that will catch errors before your business team does. So whether you’re ingesting data from an API, transforming it with dbt, updating BI tools, or sending data alerts, Shipyard centralizes these operations and handles the heavy lifting so your data team can finally focus on what they’re good at — solving problems with data. Go to dataengineeringpodcast.com/shipyard to get started automating with their free developer plan today! Your host is Tobias M

Charting the Path of Riskified's Data Platform Journey

2022-07-10 · Data Engineering Podcast Listen

podcast_episode

by Inbar Yogev (Riskified) , Lior Winner (Riskified) , Tobias Macey

Airflow Analytics API BI CI/CD Data Engineering Data Management Datafold dbt Kubernetes MongoDB MySQL +2 more

Summary Building a data platform is a journey, not a destination. Beyond the work of assembling a set of technologies and building integrations across them, there is also the work of growing and organizing a team that can support and benefit from that platform. In this episode Inbar Yogev and Lior Winner share the journey that they and their teams at Riskified have been on for their data platform. They also discuss how they have established a guild system for training and supporting data professionals in the organization.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show! Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan’s active metadata capabilities. Push information about data freshness and quality to your business intelligence, automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans can focus on delivering real value. Go to dataengineeringpodcast.com/atlan today to learn more about how Atlan’s active metadata platform is helping pioneering data teams like Postman, Plaid, WeWork & Unilever achieve extraordinary things with metadata and escape the chaos. Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it’s often too late and damage is done. Datafold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests. Datafold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. No more shipping and praying, you can now know exactly what will change in your database! Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Visit dataengineeringpodcast.com/datafold today to book a demo with Datafold. Tired of deploying bad data? Need to automate data pipelines with less red tape? Shipyard is the premier data orchestration platform built to help your data team quickly launch, monitor, and share workflows in a matter of minutes. Build powerful workflows that connect your entire data stack end-to-end with a mix of your code and their open-source, low-code templates. Once launched, Shipyard makes data observability easy with logging, alerting, and retries that will catch errors before your business team does. So whether you’re ingesting data from an API, transforming it with dbt, updating BI tools, or sending data alerts, Shipyard centralizes these operations and handles the heavy lifting so your data team can finally focus on what they’re good at — solving problems with data. Go to dataengineeringpodcast.com/shipyard to get started automating with their free developer plan today! Your host is Tobias Macey and today I’m interviewing Inbar Yogev and Lior Winner about the data platform that the team at Riskified are building to power their fraud management service

Interview

Introduction How did

The View From The Lakehouse Of Architectural Patterns For Your Data Platform

2022-07-03 · Data Engineering Podcast Listen

podcast_episode

by Colleen Tartow (Starburst Data) , Tobias Macey

Airflow Analytics API BI CI/CD Data Engineering Data Lakehouse Data Management Datafold dbt Kubernetes Modern Data Stack +4 more

Summary The ecosystem for data tools has been going through rapid and constant evolution over the past several years. These technological shifts have brought about corresponding changes in data and platform architectures for managing data and analytical workflows. In this episode Colleen Tartow shares her insights into the motivating factors and benefits of the most prominent patterns that are in the popular narrative; data mesh and the modern data stack. She also discusses her views on the role of the data lakehouse as a building block for these architectures and the ongoing influence that it will have as the technology matures.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show! Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan’s active metadata capabilities. Push information about data freshness and quality to your business intelligence, automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans can focus on delivering real value. Go to dataengineeringpodcast.com/atlan today to learn more about how Atlan’s active metadata platform is helping pioneering data teams like Postman, Plaid, WeWork & Unilever achieve extraordinary things with metadata and escape the chaos. Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it’s often too late and damage is done. Datafold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests. Datafold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. No more shipping and praying, you can now know exactly what will change in your database! Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Visit dataengineeringpodcast.com/datafold today to book a demo with Datafold. Tired of deploying bad data? Need to automate data pipelines with less red tape? Shipyard is the premier data orchestration platform built to help your data team quickly launch, monitor, and share workflows in a matter of minutes. Build powerful workflows that connect your entire data stack end-to-end with a mix of your code and their open-source, low-code templates. Once launched, Shipyard makes data observability easy with logging, alerting, and retries that will catch errors before your business team does. So whether you’re ingesting data from an API, transforming it with dbt, updating BI tools, or sending data alerts, Shipyard centralizes these operations and handles the heavy lifting so your data team can finally focus on what they’re good at — solving problems with data. Go to dataengineeringpodcast.com/shipyard to get started automating with their free developer plan today! Your host is Tobias Macey and today I’m interviewing Colleen Tartow about her views on the forces shaping th

Beyond Testing: How to Build Circuit Breakers with Airflow

2022-07-01 · Airflow Summit 2022

session

by Prateek Chawla (Monte Carlo)

Airflow DataOps DevOps Monte Carlo Python

Testing is an important part of the DataOps life cycle, giving teams confidence in the integrity of their data as it moves downstream to production systems. But what happens when testing doesn’t catch all of your bad data and “unknown unknown” data quality issues fall through the cracks? Fortunately, data engineers can apply a thing or two from DevOps best practices to tackle data quality at scale with circuit breakers, a novel approach to stopping bad data from actually entering your pipelines in the first place. In this talk, Prateek Chawla, Founding Team Member and Technical Lead at Monte Carlo, will discuss what circuit breakers are, how to integrate them with your Airflow DAGs, and what this looks like in practice. Time permitting, Prateek will also walk through how to build and automate Airflow circuit breakers across multiple cascading pipelines with Python and other common tools.

How to Achieve Reliable Data in your Airflow Pipelines with Databand

2022-07-01 · Airflow Summit 2022

session

by Josh Benamram (Databand.ai)

Airflow

Have data quality issues? What about reliability problems? You may be hearing a lot of these terms, along with many others, that describe issues you face with your data. What’s the difference, which are you suffering from, and how do you tackle both? Knowing that your Airflow DAGs are green is not enough. It’s time to focus on data reliability and quality measurements to build trust in your data platform. Join this session to learn how Databand’s proactive data observability platform makes it easy to achieve trusted data in your pipelines. In this session we’ll cover: The core differences between data reliability vs data quality What role the data platform team, analysts, and scientists play in guaranteeing data quality Hands-on demo setting up Airflow reliability tracking

How to eliminate Data Downtime & start trusting your data

2022-07-01 · Airflow Summit 2022

session

by Barr Moses (Monte Carlo)

DataOps

Broken data is costly, time-consuming, and nowadays, an all-too-common reality for even the most advanced data teams. In this talk, I’ll introduce this problem, called “data downtime” — periods of time when data is partial, erroneous, missing or otherwise inaccurate — and discuss how to eliminate it in your data ecosystem with end-to-end data observability. Drawing corollaries to application observability in software engineering, data observability is a critical component of the modern DataOps workflow and the key to ensuring data trust at scale. I’ll share why data observability matters when it comes to building a better data quality strategy and highlight tactics you can use to address it today.

Ingesting Game Telemetry in near Real-time dynamically into Redshift with Airflow (WB Games)

2022-07-01 · Airflow Summit 2022

session

by Karthik Kadiyam (Warner Bros. Games)

Airflow Data Engineering Redshift S3

We the Data Engineering Team here at WB Games implemented an internal Redshift Loader DAG(s) on Airflow that allow us to ingest data in near real-time at scale into Redshift, taking into account variable load on the DB and been able to quickly catch up data loads in case of various DB outages or high usage scenarios. Highlights: Handle any type of Redshift outages and system delays dynamically between multiple sources(S3) to sinks(Redshift). Auto tuning data copies for faster data backfill in case of delay without overwhelming commit queue. Supports schema evolution on Game data dynamically. Maintain data quality to ensure we do not create data gaps or dupes. Provide embedded custom metrics for deeper insights and anomaly detection. Airflow config based Declarative Dag implementation.

Keep Calm & Query On: Debugging Broken Data Pipelines with Airflow

2022-07-01 · Airflow Summit 2022

session

by Francisco Alberini

Airflow dbt Monte Carlo

“Why is my data missing?” “Why didn’t my Airflow job run?” “What happened to this report?” If you’ve been on the receiving end of any of these questions, you’re not alone. As data pipelines become increasingly complex and companies ingest more and more data, data engineers are on the hook for troubleshooting where, why, and how data quality issues occur, and most importantly, fixing them so systems can get up and running again. In this talk, Francisco Alberini, Monte Carlo’s first product hire, discusses the three primary factors that contribute to data quality issues and how data teams can leverage Airflow, dbt, and other solutions in their arsenal to conduct root cause analysis on their data pipelines.

Preventative Metadata: Building for data reliability with DataHub, Airflow, & Great Expectations

2022-07-01 · Airflow Summit 2022

session

by Tamás Németh , John Joyce (Acryl Data)

Airflow

Recently there has been much discussion around data monitoring, particularly in regards to reducing time to mitigate data quality problems once they’ve been detected. The problem with reactive or periodic monitoring as the de-facto standard for maintaining data quality is that it’s expensive. By the time a data problem has been identified, it’s effects may have been amplified across a myriad of downstream consumers, leaving you (a data engineer) with a big mess to clean-up. In this talk, we will present an approach for proactively addressing data quality problems using orchestration based on a central metadata graph. Specifically, we will walk through use cases highlighting how the open source metadata platform DataHub can enable proactive pipeline circuit-breaking by serving as the source of truth for both technical & semantic health status for a pipeline’s data dependencies. We’ll share practical recipes for how three powerful open source projects can be combined to build reliable data pipelines: Great Expectations for generating technical health signals in the form of assertion results on datasets, DataHub for providing a semantic identity for a dataset, including ownership, compliance, & lineage, and Airflow for orchestration.

The Five Shades of Observability: Business, Operations, Pipelines, & Data Quality - Audio Blog

2022-06-27 · Secrets of Data Analytics Leaders Listen

podcast_episode

AI/ML

It’s tempting to dismiss observability as another overused buzzword. But this emerging discipline offers substantive methods for enterprises to monitor and optimize business metrics, IT operations, data pipelines, machine learning models, and data quality. Published at: https://www.eckerson.com/articles/the-five-shades-of-observability-business-operations-pipelines-models-and-data-quality

Bring Geospatial Analytics Across Disparate Datasets Into Your Toolkit With The Unfolded Platform

2022-06-27 · Data Engineering Podcast Listen

podcast_episode

by Isaac Brodsky (Unfolded (formerly Foursquare)) , Tobias Macey

AI/ML Airflow Analytics API BI CI/CD Data Engineering Data Management Datafold DataOps dbt Kubernetes +4 more

Summary The proliferation of sensors and GPS devices has dramatically increased the number of applications for spatial data, and the need for scalable geospatial analytics. In order to reduce the friction involved in aggregating disparate data sets that share geographic similarities the Unfolded team built a platform that supports working across raster, vector, and tabular data in a single system. In this episode Isaac Brodsky explains how the Unfolded platform is architected, their experience joining the team at Foursquare, and how you can start using it for analyzing your spatial data today.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show! Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan’s active metadata capabilities. Push information about data freshness and quality to your business intelligence, automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans can focus on delivering real value. Go to dataengineeringpodcast.com/atlan today to learn more about how Atlan’s active metadata platform is helping pioneering data teams like Postman, Plaid, WeWork & Unilever achieve extraordinary things with metadata and escape the chaos. Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it’s often too late and damage is done. Datafold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests. Datafold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. No more shipping and praying, you can now know exactly what will change in your database! Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Visit dataengineeringpodcast.com/datafold today to book a demo with Datafold. Unstruk is the DataOps platform for your unstructured data. The options for ingesting, organizing, and curating unstructured files are complex, expensive, and bespoke. Unstruk Data is changing that equation with their platform approach to manage your unstructured assets. Built to handle all of your real-world data, from videos and images, to 3d point clouds and geospatial records, to industry specific file formats, Unstruk streamlines your workflow by converting human hours into machine minutes, and automatically alerting you to insights found in your dark data. Unstruk handles data versioning, lineage tracking, duplicate detection, consistency validation, as well as enrichment through sources including machine learning models, 3rd party data, and web APIs. Go to dataengineeringpodcast.com/unstruk today to transform your messy collection of unstructured data files into actionable assets that power your business. Your host is Tobias Macey and today I’m interviewing Isaac Brodsky about Foursquare’s Unfolded platform for working w

talk-data.com

Activity Trend

Top Events

Top Speakers

Building Production-Ready Recommender Systems with Feature Stores

Cleanlab: AI to Find and Fix Errors in ML Datasets

Enabling BI in a Lakehouse Environment: How Spark and Delta Can Help With Automating a DWH Develop

Discover Data Lakehouse With End-to-End Lineage

ÀLaSpark: Gousto's Recipe for Building Scalable PySpark Pipelines

Best Practices of Maintaining High-Quality Data

Computational Data Governance at Scale

Git for Data Lakes—How lakeFS Scales Data Versioning to Billions of Objects

Data centric AI development From Big Data to Good Data Andrew Ng

Making The Total Cost Of Ownership For External Data Manageable With Crux

Charting the Path of Riskified's Data Platform Journey

The View From The Lakehouse Of Architectural Patterns For Your Data Platform

Beyond Testing: How to Build Circuit Breakers with Airflow

How to Achieve Reliable Data in your Airflow Pipelines with Databand

How to eliminate Data Downtime & start trusting your data

Ingesting Game Telemetry in near Real-time dynamically into Redshift with Airflow (WB Games)

Keep Calm & Query On: Debugging Broken Data Pipelines with Airflow

Preventative Metadata: Building for data reliability with DataHub, Airflow, & Great Expectations

The Five Shades of Observability: Business, Operations, Pipelines, & Data Quality - Audio Blog

Bring Geospatial Analytics Across Disparate Datasets Into Your Toolkit With The Unfolded Platform