talk-data.com talk-data.com

Topic

Airflow

Apache Airflow

workflow_management data_orchestration etl

682

tagged

Activity Trend

157 peak/qtr
2020-Q1 2026-Q1

Activities

682 activities · Newest first

Kiwi.com started using Airflow in June 2016 as an orchestrator for several people in the company. The need for the tool grew and the monolithic instance was used by 30+ teams having 500+ DAGs active resulting in 3.5 million tasks/month successfully finished. At first, we moved to using a monolithic Airflow environment, but our needs quickly changed as we wanted to support a data mesh architecture within kiwi.com. By leveraging Astronomer on GCP, we were able to move from a monolithic Airflow environment to many smaller instances of Airflow. This talk will go into how to handle things like DAG dependencies, observability, and stakeholder management. Furthermore, we’ll talk about security, particularly how GCP’s workload identity helped us achieve a passwordless Airflow experience.

Have you ever added a DAG file and had no clue what happened to it? You’re not alone! With default settings, Airflow can wait up to 5 minutes before processing new DAG files. In this talk, I’ll discuss the implementation of an event-based DAG parser that immediately processes changes in the DAGs folder. As a result, changes are reflected immediately in the Airflow UI. In this talk I will cover: A demonstration of the event-based DAG parser and the fast Airflow UI experience How the current DAG parser implementation and configuration works How an event-based DAG parser is implemented

As user of Airflow we often use DagRun.conf attributes to control content and flow of a DAG run. Previously the Airflow UI only allowed to launch via JSON in the UI. This was technically feasible but not user friendly. A user needs to model, check and understand the JSON and enter parameters manually without the option to validate before trigger. Similar like Jenkins or Github/Azure pipelines we desire an UI option to trigger with a UI and specifying parameters. With Airflow 2.6.0 now the DAG.params are used to render a nice entry form and with a bit of options a user friendly trigger UI can be implemented. This session is showing how the new feature works and provides some examples how to use it for your purposes.

As a data engineer, I’ve used Airflow extensively over the last 5 years: across 3 jobs, several different roles; for side projects, for critical infrastructure; for manually triggered jobs, for automated workflows; for IT (Ookla/Speedtest.net), for science (Allen Institute for Cell Science), for the commons (Openverse), for liberation (Orca Collective). Authoring a DAG has changed dramatically since 2018, thanks to improvements in both Airflow and the Python language. In this session, we’ll take a trip back in time to see how DAGs looked several years ago, and what the same DAGs might look like now. We’ll appreciate the many improvements that have been made towards simplifying workflow construction. I’ll also discuss the significant advancements that have been made around deploying Airflow. Lastly, I’ll give a brief overview of different use cases and ways I’ve seen Airflow leveraged.

How to submit an Issue for community to fix To ensure a quality product the Airflow community relies on bug reports from Airflow users. Often times bug reports are incomplete or fail to include steps for observed bug to be re-created. This workshop will present an example of bug-to-issue process, namely, how to rule out non-Airflow issues and once an Airflow issue is suspected, how to submit an issue for community to see. This could also cover how the community picks up on an issue and eventually fixes it in a future release. Issue reporting is key to improving Airflow and the community will benefit on an easily digestible guide of how best to do so.

Are you tired of spending hours on Airflow migrations and wondering how to make them more accessible? Would you like to be able to test your code on different Airflow versions? or are you struggling to set up a reliable local development environment? These are some of the top pain points for data engineers working with Airflow. But fear not – Wix Data Engineering has some best practices to share that will make your life easier. What will the audience learn: How does Wix Data Engineering make Airflow migrations easier and less painful. How to ensure DEs code is forward-compatible with the latest Airflow version. How to test code on different Airflow versions How to maintain a stable local environment for DEs while speeding up their dev velocity. Some more must-know framework team’s best practices.

We are continuing to modernize the Airflow UI to make it easier to manage all aspects of your DAGs. See a demo of the latest updates and improve your workflows with new tips and tricks. Then get a preview of what else will be coming soon. Followed up by Q&A for people to field their own use-cases and explore new ideas on how to improve the user experience.

New to Airflow or haven’t followed any of the recent DAG authoring enhancements? This talk is for you! We will go through various DAG authoring features like Setup/Teardown tasks (~2.7), Datasets (2.4), Dynamic Tasks (2.3) and Async tasks (2.2). You won’t be an expert after this short talk, however, you’ll have a head start when you write your next DAG, no hacks required.

Data contracts have been much discussed in the community of late, with a lot of curiosity around how to approach this concept in practice. We believe data contracts need a harmonizing layer to manage data quality in a uniform manner across a fragmented stack. We are calling this harmonizing layer the Control Plane for Data - powered by the common thread across these systems: metadata. For teams already orchestrating pipelines with Airflow, data contacts can be an effective way to process data that meets preset quality standards. With a control plane as a connecting layer, producers can build data contracts that consumers can rely on, ensuring DAGs only run when a contract is valid. Producers can govern how workflows should behave, and consumers receive the tooling they need to only opt into high quality data. Learn how to use data contracts and DataHub to make your Airflow pipelines more reliable - as well as other use cases that can help build a simpler, more flexible data stack.

New users starting with Airflow frequently encounter several challenges, ranging from the complexities of Containers and virtual environments to the Python dependency hell. Moreover, their familiarity with tools such as Docker, docker-compose, and Helm might be somewhat limited and even overkill. In contrast, seasoned Airflow users encounter their problems, encompassing configuration conflics with ongoing Airflow projects and intricacies stemming from Docker and docker-compose configurations and lack of visibility into all the projects. With airflowctl, users can install & setup Airflow using a single command. For existing users, they can use it to manage multiple Airflow projects with different Airflow versions on the same machine. This allows creating & debugging DAGs in an IDE seamlessly. Agenda for the call: Why airflowctl? Goal Current functionality & Demo Vision / Roadmap

Airflow is a popular choice for organizations looking to integrate open-source dbt within their existing data infrastructure. This talk will explore two primary methods of running dbt in Airflow: job-level and model-level. We’ll discuss the tradeoffs associated with each approach, highlighting the simplicity and efficiency of job-level orchestration, contrasted with the enhanced observability and control provided by model-level orchestration. We’ll also explain how the balance has shifted in recent years, with improvements to dbt core making model-level more efficient and innovative Airflow extensions like Cosmos making it easier to implement. Finally, we’ll provide benchmarks to help you determine which paradigm is the best fit for your needs.

Apache Airflow has over 650 Python dependencies. In case you did not know already, dependencies in Python are difficult subject. And Airflow has its own, custom ways of managing the dependencies. Airflow has a rather complex system to manage dependencies in their CI system, but this talk is not about it. This talk is directed to the users of Airflow who want to keep their dependencies updated, describing ways they can do it. This presentation will explain how to effectively manage and handle custom dependencies in Airflow. Jarek will guide you through practical solutions and best practices to make your Airflow experience with dependencies - yes you guessed it - a breeze.

Introduced in Airflow 2.4, Datasets are a foundational feature for authoring modular data pipelines. As DAGs grow to encompass a larger number of data sources and encompass multiple data transformation steps, they typically become less predictable in the timeliness of execution and less efficient. This talk focuses on leveraging Datasets to enable predictable and more efficient DAGs, by leveraging patterns from microservice architectures. Just as large monolithic applications were decomposed into micro-services to deliver more efficient scalability and faster development cycles, micropipelines have the same potential to radically transform data pipeline efficiency and development velocity. Using a simple financial analysis pipeline example, with data aggregation being done in Snowflake and prediction analysis in Spark, this talk outlines how to retain timelines of data pipelines while expanding data sets.

session
by Jarek Potiuk (Apache Software Foundation) , Vincent Beck

This sesion is about the current state of implementation for multi-tenancy feature of Airflow. This is a long-term feature that involves multiple changes, separate AIPs to implement, with the long-term vision of having single Airflow instance supporting multiple, independed teams using it - either from the same company or as part of Airflow-As-A-Service implementation.

Apache Airflow is one of the largest Apache projects by many metrics but it ranks particularly high in the number of contributors involved in the project. This leads to hundreds of Github Issues, Pull Requests and Discussions being submitted to the project every month. So it is critical to have an ample number of Committers to support the community. In this talk I will summarize my personal experience working towards, and ultimately achieving, committer status in Apache Airflow. I’ll cover the lessons I learned along the way as well as provide some advice and best practices to help others achieve committer status themselves.

With native support for OpenLineage in Airflow, users can now observe and manage their data pipelines with ease. This talk will cover the benefits of using OpenLineage, how it is implemented in Airflow, practical examples of how to take advantage of it, and what’s in our roadmap. Whether you’re an Airflow user or provider maintainer, this session will give you the knowledge to make the most of this tool.

Operators form the core of the language of Airflow. In this talk I will argue that while they have served their purpose, they are holding back the development of Airflow and if Airflow wants to stay relevant in the world of the ’new’ data stack (hint: it isn’t currently considered to be part of it) self-service data mesh it needs to kill its darling.