talk-data.com talk-data.com

Event

Airflow Summit 2021

2021-07-01 Airflow Summit Visit website ↗

Activities tracked

4

Airflow Summit 2021 program

Filtering by: Data Quality ×

Sessions & talks

Showing 1–4 of 4 · Newest first

Search within this event →

Building a robust data pipeline with the dAG stack: dbt, Airflow, Great Expectations

2021-07-01
session

Data quality has become a much discussed topic in the fields of data engineering and data science, and it has become clear that data validation is absolutely crucial to ensuring the reliability of any data products and insights produced by an organization’s data pipelines. This session will outline patterns for combining three popular open source tools in the data ecosystem - dbt, Airflow, and Great Expectations - and use them to build a robust data pipeline with data validation at each critical step.

Data Lineage with Apache Airflow using OpenLineage

2021-07-01
session

If you manage a lot of data, and you’re attending this summit, you likely rely on Apache Airflow to do a lot of the heavy lifting. Like any powerful tool, Apache Airflow allows you to accomplish what you couldn’t before… but also creates new challenges. As DAGs pile up, complexity layers on top of complexity and it becomes hard to grasp how a failed or delayed DAG will affect everything downstream. In this session we will provide a crash course on OpenLineage, an open platform for metadata management and data lineage analysis. We’ll show how capturing metadata with OpenLineage can help you maintain inter-DAG dependencies, capture data on historical runs, and minimize data quality issues.

Guaranteeing pipeline SLAs and data quality standards with Databand

2021-07-01
session
Josh Benamram (Databand.ai) , Vinoo Ganesh (Veraset)

We’ve all heard the phrase “data is the new oil.” But really imagine a world where this analogy is more real, where problems in the flow of data - delays, low quality, high volatility - could bring down whole economies? When data is the new oil with people and businesses similarly reliant on it, how do you avoid the fires, spills, and crises? As data products become central to companies’ bottom line, data engineering teams need to create higher standards for the availability, completeness, and fidelity of their data. In this session we’ll demonstrate how Databand helps organizations guarantee the health of their Airflow pipelines. Databand is a data pipeline observability system that monitors SLAs and data quality issues, and proactively alerts users on problems to avoid data downtime. The session will be led by Josh Benamram, CEO and Cofounder of Databand.ai. Josh will be joined by Vinoo Ganesh, an experienced software engineer, system architect, and current CTO of Veraset, a data-as-a-service startup focused on understanding the world from a geospatial perspective. Join to see how Databand.ai can help you create stable, reliable pipelines that your business can depend on!

Productionizing ML Pipelines with Airflow, Kedro, and Great Expectations

2021-07-01
session

Machine Learning models can add value and insight to many projects, but they can be challenging to put into production due to problems like lack of reproducibility, difficulty maintaining integrations, and sneaky data quality issues. Kedro, a framework for creating reproducible, maintainable, and modular data science code, and Great Expectations, a framework for data validations, are two great open-source Python tools that can address some of these problems. Both integrate seamlessly with Airflow for flexible and powerful ML pipeline orchestration. In this talk we’ll discuss how you can leverage existing Airflow provider packages to integrate these tools to create sustainable, production-ready ML models.