talk-data.com

Topic

Airflow

Apache Airflow

workflow_management data_orchestration etl

Activities

tagged

Activity Trend

157 peak/qtr

2020-Q1 2026-Q1

Top Events

Airflow Summit 2025 139 Data Engineering Podcast 122 Airflow Summit 2024 92 Airflow Summit 2023 81 Airflow Summit 2022 52 Airflow Summit 2021 52 Airflow Summit 2020 39 O'Reilly Data Engineering Books 11 DATA MINER Big Data Europe Conference 2020 5 dbt Coalesce 2022 5 Airflow Monthly Virtual Town Hall- August 4 Airflow Monthly Virtual Town Hall- March 4

Top Speakers

Tobias Macey 122 Jarek Potiuk (Apache Software Foundation) 15 Kaxil Naik 12 Ash Berlin-Taylor (Astronomer) 11 Rafal Biegacz 10 Vikram Koka (Astronomer) 9 John Jackson 9 Brent Bovenzi (Astronomer) 7 Amogh Rajesh Desai 7 Maxime Beauchemin (Preset) 7 Tatiana Al-Chueyr Martins (Astronomer) 6 Jens Scheffler 6

Activities

Showing filtered results

All Video Podcast Book

Filtering by: Airflow Summit 2022 ×

Happy DAGs + Happy Teammates: How a little CI/CD can go a long way

2022-07-01 · Airflow Summit 2022

session

by Leah Cole

CI/CD Cloud Computing Git GitHub

With a small amount of Cloud Build automation and the use of GitHub version control, your Airflow DAGs will always be tested and in sync no matter who is working on them. Leah will walk you through a sample CICD workflow for keeping your Airflow DAGs tested and in sync between environments and teammates.

Hey maintainer (and user), exercise your empathy!

2022-07-01 · Airflow Summit 2022

session

by Jarek Potiuk (Apache Software Foundation)

This talk is a walk through throug a number of ways maintainers of open-source projects (for example Airflow) can improve the communication with their users by exercising empathy. This subject is often overlooked in the cirriculum of average developer and contributor, but one that can make or break the product you developed, simply because it will become more approachable for users. Maintainers often forget or simply do not realize how many assumptions they have in their head. There are a number of techniques maintainers can use to improve it. This talk will walk through a number of examples (from Airflow and other projects), reasoning and ways how communication between maintainers and users can be improved - in the code, documentation, communication but also with involving and engaging the users they are commmunicating with, as more often than not - the users might be of great help when it comes to communication with them - if only asked. This talk is for both - maintainers and users, as I consider communication between users and maintainers two way street.

How DAG Became a Test - Airflow System Tests Redefined

2022-07-01 · Airflow Summit 2022

session

by Mateusz Nojek , Bartłomiej Hirsz

Nothing is perfect, but it doesn’t mean we shouldn’t seek perfection. After some time spent with Airflow system tests, we have recognized numerous places in which we can make significant improvements. We decided to rediscover them. The new design started with the establishment of goals. Tests need to: be easy to write, read, run and maintain, be as close as possible to how Airflow runs in practice, be fast, reliable and verifiable, assure high quality of Airflow Operators. With these principles in mind, we prepared an Airflow Improvement Proposal (AIP-47) and after it’s confirmed, we started the implementation. The results of our work were better than we expected when we started this initiative. This session will walk you through the story of how we struggled to run system tests before, how we came up with the improvements, and how we put them into a working solution.

How to Achieve Reliable Data in your Airflow Pipelines with Databand

2022-07-01 · Airflow Summit 2022

session

by Josh Benamram (Databand.ai)

Data Quality

Have data quality issues? What about reliability problems? You may be hearing a lot of these terms, along with many others, that describe issues you face with your data. What’s the difference, which are you suffering from, and how do you tackle both? Knowing that your Airflow DAGs are green is not enough. It’s time to focus on data reliability and quality measurements to build trust in your data platform. Join this session to learn how Databand’s proactive data observability platform makes it easy to achieve trusted data in your pipelines. In this session we’ll cover: The core differences between data reliability vs data quality What role the data platform team, analysts, and scientists play in guaranteeing data quality Hands-on demo setting up Airflow reliability tracking

How to Deploy Airflow From Dev to Prod Like A BOSS

2022-07-01 · Airflow Summit 2022

session

by Evgeny Shulman

Managing Airflow in large-scale environments is tough. You know this, and I know this. But, what if you had a guide to make development, testing, and production lifecycles more manageable? In this presentation, I will share how we manage Airflow for large-scale environments with friendly deployments at every step. After attending the session, Airflow engineers will: Understand the advantages of each kind of deployment Know the differences between Deployment and Airflow Executor Deploy how to incorporate all kinds of deployments for their day-to-day needs

Implementing Event-Based DAGs with Airflow

2022-07-01 · Airflow Summit 2022

session

by Kenten Danas

API

Needing to trigger DAGs based on external criteria is a common use case for data engineers, data scientists, and data analysts. Most Airflow users are probably aware of the concept of sensors and how they can be used to run your DAGs off of a standard schedule, but sensors are only one of multiple methods available to implement event-based DAGs. In this session, we’ll discuss different ways of implementing event-based DAGs using Airflow 2 features like the API and deferrable operators, with a focus on how to determine which method is the most efficient, scalable, and cost-friendly for your use case.

Ingesting Game Telemetry in near Real-time dynamically into Redshift with Airflow (WB Games)

2022-07-01 · Airflow Summit 2022

session

by Karthik Kadiyam (Warner Bros. Games)

Data Engineering Data Quality Redshift S3

We the Data Engineering Team here at WB Games implemented an internal Redshift Loader DAG(s) on Airflow that allow us to ingest data in near real-time at scale into Redshift, taking into account variable load on the DB and been able to quickly catch up data loads in case of various DB outages or high usage scenarios. Highlights: Handle any type of Redshift outages and system delays dynamically between multiple sources(S3) to sinks(Redshift). Auto tuning data copies for faster data backfill in case of delay without overwhelming commit queue. Supports schema evolution on Game data dynamically. Maintain data quality to ensure we do not create data gaps or dupes. Provide embedded custom metrics for deeper insights and anomaly detection. Airflow config based Declarative Dag implementation.

Introducing Astro Python SDK: The next generation of DAG authoring

2022-07-01 · Airflow Summit 2022

session

by Daniel Imberman

Astronomer BigQuery Python S3 Snowflake SQL postgresql

Imagine if you could chain together SQL models using nothing but python, write functions that treat Snowflake tables like dataframes and dataframes like SQL tables. Imagine if you could write a SQL airflow DAG using only python or without using any python at all. With Astro SDK, we at Astronomer have gone back to the drawing board around fundamental questions of what DAG writing could look like. Our goal is to empower Data Engineers, Data Scientists, and even the Business Analysts to write Airflow DAGs with code that reflects the data movement, instead of the system configuration. Astro will allow each group to focus on producing value in their respective fields with minimal knowledge of Airflow and high amounts of flexibility between SQL or python-based systems. This is way beyond just a new way of writing DAGs. This is a universal agnostic data transfer system. Users can run the exact same code against different databases (snowflake, bigquery, etc.) and datastores (GCS, S3, etc.) with no changes except to the connection IDs. Users will be able to promote a SQL flow from their dev postgres to their prod snowflake with a single variable change. We are ecstatic to reveal over eight months of work around building a new open-source project that will significantly improve your DAG authoring experience!

Keep Calm & Query On: Debugging Broken Data Pipelines with Airflow

2022-07-01 · Airflow Summit 2022

session

by Francisco Alberini

Data Quality dbt Monte Carlo

“Why is my data missing?” “Why didn’t my Airflow job run?” “What happened to this report?” If you’ve been on the receiving end of any of these questions, you’re not alone. As data pipelines become increasingly complex and companies ingest more and more data, data engineers are on the hook for troubleshooting where, why, and how data quality issues occur, and most importantly, fixing them so systems can get up and running again. In this talk, Francisco Alberini, Monte Carlo’s first product hire, discusses the three primary factors that contribute to data quality issues and how data teams can leverage Airflow, dbt, and other solutions in their arsenal to conduct root cause analysis on their data pipelines.

Lets use Airflow differently: let's talk load tests

2022-07-01 · Airflow Summit 2022

session

by Doron Cohen

AI/ML API

Numeric results with bulletproof confidence: this is what companies actually sell when promoting their machine learning product. Yet this seems out of reach when the product is both generic and complex, with much of the inner calculations hidden from the end user. So how can code improvements or changes in core component performance be tested at scale? Implementing API and Load Tests is time-consuming, but thorough: defining parameters, building infrastructure and debugging. The bugs may be real, but they can also be a result of poor infrastructure implementation (who is testing the testers?). In this session we will discuss how Airflow can help scale up testing in a stable and sustainable way.

Manage Dags at scale (Dags versioning & package management)

2022-07-01 · Airflow Summit 2022

session

by Anum Sheraz

Git

This talk is all about how we at Jagex manage DAGs at scale. Focusing of following challenges faced and how we resolved them; Keeping track of airflow state Keeping track of each DAG state DAGs as git submodules Updating airflow with new dags How to seamlessly automate airflow deployment How to avoid package dependency conflicts

Managing Apache Airflow at Scale

2022-07-01 · Airflow Summit 2022

session

by John Jackson

CI/CD

In this session we’ll be discussing the considerations and challenges when running Apache Airflow at scale. We’ll start by defining what it means to run Airflow at scale. Then we’ll dive deep into understanding limitations of the Airflow architecture, Scheduler processes, and configuration options. We’ll then define scaling workloads via containers and leveraging pools and priority, followed by scaling DAGs via dDynamic DAGs/DAG factories, CI/CD, and DAG access control. Finally we’ll get into managing Multiple Airflow Environments, how to split up workloads, and provide central governance for Airflow environment creation and monitoring with an example of Distributing workloads across environments.

Managing Multiple ML Models For Multiple Clients : Steps For Scaling Up

2022-07-01 · Airflow Summit 2022

session

by Ori Peri

AI/ML KPI MLOps SaaS

For most ML-based SaaS companies, the need to fulfill each customer’s KPI will usually be addressed by matching a dedicated model. Along with the benefits of optimizing the model’s performance, a model per customer solution carries a heavy production complexity with it. In this manner, incorporating up-to-date data as well as new features and capabilities as part of a model’s retraining process can become a major production bottleneck. In this talk, we will see how Riskified scaled up modeling operations based on MLOps ideas, and focus on how we used Airflow as our ML pipeline orchestrator. We will dive into how we wrap Airflow as an internal service, the goals we started with, the obstacles along the way and finally - how we solved them. You will receive tools for how to set up your own Airflow-based continuous training ML pipeline, and how we adjusted it such that ML engineers and data scientists would be able to collaborate and work in parallel using the same pipeline.

Modern Data Orchestration managed by Astronomer

2022-07-01 · Airflow Summit 2022

session

by Navid Aghdaie

Astronomer Cloud Computing

At Astronomer we have been longtime supporters and contributors to open source Apache Airflow. In this session we will present Astronomer’s latest journey, Astro, our cloud-native managed service that simplifies data orchestration and reduces operational overhead. We will also discuss the increasing importance of data orchestration in modern enterprise data platforms, industry trends, and practical problems that arise in the ever expanding heterogeneous environments.

Multitenancy is coming

2022-07-01 · Airflow Summit 2022

session

by Jarek Potiuk (Apache Software Foundation)

Kubernetes

This session is about the state and future plans of the multi-tenancy feature of Airflow. Airflow has traditionally been single-tenant product. Mutliple instances could be bound together to provide a multi-tenant implementation and when using a modern infrastructure - Kubernetes - you could even reuse resources between those - but it was not a true “multi-tenant” solution. But Airflow becomes more of a platform now and the needs for multi-tenancy as a feature of the platform are highly expected by a number of users. In 2022 we’ve started to add multi-tenant features and we are aiming to make Airflow Multi-Tenant in the near* future. This talk is about the state of the multi-tenancy now and the future plans we have for Airflow becoming full multi-tenant platform.

On-Demand DAG through the REST API

2022-07-01 · Airflow Summit 2022

session

by Mocheng Guo

API

In this talk we want to present how Airbnb extends the REST api to support on-demand workload. A DAG object is created from a local environment like Jupyter notebook, serialized into binary and transported to the API. The API persists the DAG object into the meta DB and Airflow scheduler and worker are extended to process this new kind of DAG.

OpenLineage & Airflow - data lineage has never been easier

2022-07-01 · Airflow Summit 2022

session

by Maciej Obuchowski (Datadog) , Pawel Leszczynski (GetInData)

Data Governance

OpenLineage is an open standard for metadata and lineage collection designed to instrument jobs as they are running. The standard has become remarkably adept at understanding the lifecycle of data within an organization. Additionally, Airflow lets you make use of OpenLineage with a convenient integration. Gathering data lineage has never been easier. In this talk, we’ll provide an update-to-date report on OpenLineage features and the Airflow integration – essential information for data governance architects & engineers.

Preventative Metadata: Building for data reliability with DataHub, Airflow, & Great Expectations

2022-07-01 · Airflow Summit 2022

session

by Tamás Németh , John Joyce (Acryl Data)

Data Quality

Recently there has been much discussion around data monitoring, particularly in regards to reducing time to mitigate data quality problems once they’ve been detected. The problem with reactive or periodic monitoring as the de-facto standard for maintaining data quality is that it’s expensive. By the time a data problem has been identified, it’s effects may have been amplified across a myriad of downstream consumers, leaving you (a data engineer) with a big mess to clean-up. In this talk, we will present an approach for proactively addressing data quality problems using orchestration based on a central metadata graph. Specifically, we will walk through use cases highlighting how the open source metadata platform DataHub can enable proactive pipeline circuit-breaking by serving as the source of truth for both technical & semantic health status for a pipeline’s data dependencies. We’ll share practical recipes for how three powerful open source projects can be combined to build reliable data pipelines: Great Expectations for generating technical health signals in the form of assertion results on datasets, DataHub for providing a semantic identity for a dataset, including ownership, compliance, & lineage, and Airflow for orchestration.

Running +150 production Airflow on Kubernetes, is that HARD ?

2022-07-01 · Airflow Summit 2022

session

by Prekshi Vyas , Alaeddine Maaoui

Kubernetes

This talk will cover the challenges we can face managing a large number of Airflow instances on private environment. Monitoring and metrics layers for production environment. Collecting and customizing logs. Resource consumption and green IT. Providing support for users and shared responsibility. Pain points

Skip tasks to make your debugging easy

2022-07-01 · Airflow Summit 2022

session

by Howie Wang

In Apple, we are building a self-serve data platform based on Airflow. Self-serve means users can create, deploy and run their DAGs freely. With provided logs and metrics, users are able to test or troubleshot DAGs on their own. Today, a common use case is, users want to test one or a few tasks in their DAG. However, when they trigger the DAG, all tasks instead of just the ones people are interested will run. To save time and resources, lots of users choose to manually mark complete for each tasks to skip. Can we do better than that? Is there an easy-peasy way to skip tasks? In this lightning talk, we would like to share the challenges we had, the solution we came up with, and the lesson we learned.

Page 2 of 3

← Previous

1 2 3