talk-data.com talk-data.com

J

Speaker

John Joyce

2

talks

Co-Founder at Acryl Data Acryl Data

Filter by Event / Source

Talks & appearances

2 activities · Newest first

Search activities →
Data Swamp to Data Swimming Pool: Cleaning up the Mess - Maggie Hays & John Joyce, Acryl Data

This talk was recorded at Crunch Conference 2022. Maggie and John from Acryl Data company data swamp to data swimming pool: cleaning up the mess with modern data governance.

"In this talk, we will present a practical approach to supercharging your data governance initiative by surfacing and leveraging metadata readily available within your data ecosystem."

The event was organized by Crafthub.

You can watch the rest of the conference talks on our channel.

If you are interested in more speakers, tickets and details of the conference, check out our website: https://crunchconf.com/ If you are interested in more events from our company: https://crafthub.events/

Recently there has been much discussion around data monitoring, particularly in regards to reducing time to mitigate data quality problems once they’ve been detected. The problem with reactive or periodic monitoring as the de-facto standard for maintaining data quality is that it’s expensive. By the time a data problem has been identified, it’s effects may have been amplified across a myriad of downstream consumers, leaving you (a data engineer) with a big mess to clean-up. In this talk, we will present an approach for proactively addressing data quality problems using orchestration based on a central metadata graph. Specifically, we will walk through use cases highlighting how the open source metadata platform DataHub can enable proactive pipeline circuit-breaking by serving as the source of truth for both technical & semantic health status for a pipeline’s data dependencies. We’ll share practical recipes for how three powerful open source projects can be combined to build reliable data pipelines: Great Expectations for generating technical health signals in the form of assertion results on datasets, DataHub for providing a semantic identity for a dataset, including ownership, compliance, & lineage, and Airflow for orchestration.