Ross Turk

Activities

2

talks

Marketing & Community

Filter by Event / Source

Airflow Summit 2022 2

Talks & appearances

2 activities · Newest first

Search activities →

An Introduction to Data Lineage with Airflow and Marquez

2022-07-01 · Airflow Summit 2022

session

with Michael Robinson , Ross Turk

Airflow

This workshop is sold out Data lineage might seem like a complicated and unapproachable topic, but that’s only because data pipelines are complicated. The core concept is straightforward: trace and record the journey of datasets as they travel through a data pipeline. Marquez, a lineage metadata server, is a simple thing designed to watch complex things. It tracks the movement of data through complex pipelines using a straightforward, clear object model of Jobs, Datasets, and Runs. The information it gathers can be used to help you more effectively understand, communicate, and solve problems. The interactive UI allows you to see exactly where any inefficiencies have developed or datasets have become compromised. In this workshop, you will learn how to collect and visualize lineage from a basic Airflow pipeline using Marquez. You will need to understand the basics of Airflow, but no experience with lineage is required.

What is data lineage and why should I care?

2022-07-01 · Airflow Summit 2022

session

If a job fails, how can you learn about downstream datasets that have become out-of-date? Can you be confident that jobs are consuming fresh, high-quality data from their upstream sources? How might you predict the impact of a planned change on distant corners of the pipeline? These questions become easier once you have a complete understanding of data lineage, the complex set of relationships between all of your jobs and datasets. In this talk, Ross Turk from Datakin will provide a quick introduction to the core concepts behind data lineage and an overview of common architectural approaches.