talk-data.com talk-data.com

Topic

Airflow

Apache Airflow

workflow_management data_orchestration etl

81

tagged

Activity Trend

157 peak/qtr
2020-Q1 2026-Q1

Activities

Showing filtered results

Filtering by: Airflow Summit 2023 ×
session
by Niko Oliveira (Amazon | Apache Airflow Comitter)

Executors are a core concept in Apache Airflow and are an essential piece to the execution of DAGs. They have seen a lot of investment over the year and there are many exciting advancements that will benefit both users and contributors. This talk will briefly discuss executors, how they work and what they are responsible for. It will then describe Executor Decoupling (AIP-51) and how this has fully unlocked development of third-party executors. We’ll touch on the migration of “core” executors (such as Celery and Kubernetes) to their own package as well as the addition of new “3rd party” executors from providers such as AWS. Finally, a description/demo of Hybrid Executors, a proposed new feature to allow multiple executors to be used natively and seamlessly side by side within a single Airflow environment; which will be a powerful feature in a future full of many new Airflow executors.

Making a contribution to or becoming a committer on Airflow can be a daunting task, even for experienced Python developers and Airflow users. The sheer size and complexity of the code base may discourage potential contributors from taking the first steps. To help alleviate this issue, this session is designed to provide a better understanding of how Airflow works and build confidence in getting started. During the session, we will introduce the main components of Airflow, including the Web Server, Scheduler, and Workers. We will also cover key concepts such as DAGs, DAG-run objects, Tasks, and Task Instances. Additionally, we will explain how tasks communicate with each other using XComs, and discuss the frequency of DAG runs based on the schedule. To showcase changes in the state of various objects, we will dive into the code level and continuously share the state of the database at every important checkpoint.

Airflow uses SQLAlchemy under the hood but up to this point has not exploited the tool’s capacity to produce detailed metadata about queries, tables, columns, and more. In fact, SQLAlchemy ships with an event listener that, in conjunction with OpenLineage, offers tantalizing possibilities for enhancing the development process – specifically in the areas of monitoring and debugging. SQLAlchemy’s event system features a Session object and ORMExecuteState mapped class that can be used to intercept statement executions and emit OpenLineage RunEvents as executions occur. In this talk, Michael Robinson from the community team at Astronomer will provide an overview and demo of new SQLAlchemyCollector and OpenLineageAdapter classes for leveraging SQLAlchemy’s event system to emit OpenLineage events as DAGs run.

Cluster Policies are an advanced Airflow feature composed of a set of hooks that allow cluster administrators to implement checks and mutations against certain core Airflow constructs (DAGs, Tasks, Task Instances, Pods). In this talk, we will discuss how cluster administrators can leverage these functions in order to better govern the workloads that are running in their environments.

OpenTelemetry is a vendor-neutral open-source (CNCF) observability framework that is supported by many vendors industry-wide. It is used for instrumenting, generation, collection, and exporting of data within systems which then are ingested by analytics tools that can provide tracing, metrics, and logs. It has long been the plan to adopt the OTel standard within Airflow, allowing builders and users to take advantage of valuable data that could help improve the efficiency, cost and performance of their systems. Let us talk about the journey that started a few years ago to bring this dream to reality.

ETL data pipelines are the bread and butter of data teams that must design, develop, and author DAGs to accommodate the various business requirements. dbt is becoming one of the most used tools to perform SQL transformations on the Data Warehouse, allowing teams to harness the power of queries at scale. Airflow users are constantly finding new ways to integrate dbt with the Airflow ecosystem and build a single pane of glass where Data Engineers can manage and administer their pipelines. Astronomer Cosmos, an open-source product, has been introduced to integrate Airflow with dbt Core seamlessly. Now you can easily see your dbt pipelines fully integrated on Airflow. You will learn the following: How to integrate dbt Core with Airflow How to use Cosmos How to build data pipelines at scale

Airflow is a household brand in data engineering: It is readily familiar to most data engineers, quick to set up, and, as proven by millions of data pipelines powered by it since 2014, it can keep DAGs running. But with the increasing demands of ML, there is a pressing need for tools that meet data scientists where they are and address two pressing issues - improving the developer experience & minimizing operational overhead. In this talk, we discuss the problem space and the approach to solving it with Metaflow, the open-source framework we developed at Netflix, which now powers thousands of business-critical ML projects at Netflix & other companies. We wanted to provide data scientists with the best possible UX, allowing them to focus on parts they like (e.g., modeling) while providing robust solutions for the foundational infrastructure: data, compute, orchestration (using Airflow), & versioning. In this talk, we will demo our latest work that builds on top of Airflow.

Airflow’s KubernetesExecutor has supported multi_namespace_mode for long time. This feature is great at allowing Airflow jobs to run in different namespaces on the same Kubernetes clusters for better isolation and easier management. However, this feature requires cluster-role for the Airflow scheduler, which can create security problems or be a blocker for some users. PR https://github.com/apache/airflow/pull/28047 , which will become available in Airflow 2.6.0, resolves this issue by allowing Airflow users to specify multi_namespace_mode_namespace_list when using multi_namespace_mode, so that no cluster-role is needed and user only needs to ensure the Scheduler has permissions to certain namespaces rather than all namespaces on the Kubernetes cluster. This talk aims to help you better understand KubernetesExecutor and how to set it up in a more secure manner.

Much of the world sees Airflow as a hammer and ETL tasks as nails, but in reality, Airflow is much more of a sophisticated multitool, capable of orchestrating a wide variety of complex workflows. Astronomer’s Customer Reliability Engineering (CRE) team is leveraging this potential in its development of Airline, a tool powered by Airflow that monitors Airflow deployments and sends alerts proactively when issues arise. In this talk, Ryan Hatter from Astronomer will give an overview of Airline. He’ll explain how it integrates with ZenDesk, Kubernetes, and other services to resolve customers’ problems more quickly, and in many cases, even before customers realize there’s an issue. Join us for a practical exploration of Airflow’s capabilities beyond ETL, and learn how proactive, automated monitoring can enhance your operations.

Amazon Managed Workflows for Apache Airflow (MWAA) was released in November 2020. Throughout MWAA’s design we held the tenets that this service would be open-source first, not forking or deviating from the project, and that the MWAA team would focus on improving Airflow for everyone—whether they run Airflow on MWAA, on AWS, or anywhere else. This talk will cover some of the design choices made to facilitate those tenets, how the organization was set up to contribute back to the community, what those contributions look like today, how we’re getting those contributions in the hands of users, and our vision for future engagement with the community.

Behind the growing interest in Generate AI and LLM-based enterprise applications lies an expanded set of requirements for data integrations and ML orchestration. Enterprises want to use proprietary data to power LLM-based applications that create new business value, but they face challenges in moving beyond experimentation. The pipelines that power these models need to run reliably at scale, bringing together data from many sources and reacting continuously to changing conditions. This talk focuses on the design patterns for using Apache Airflow to support LLM applications created using private enterprise data. We’ll go through a real-world example of what this looks like, as well as a proposal to improve Airflow and to add additional Airflow Providers to make it easier to interact with LLMs such as the ones from OpenAI (such as GPT4) and the ones on HuggingFace, while working with both structured and unstructured data. In short, this shows how these Airflow patterns enable reliable, traceable, and scalable LLM applications within the enterprise.

Volunteers in Saint Louis are using Airflow to build an open source data warehouse of real estate data (permits, assessments, violations, etc), with an eye towards creating a national open data standard. This talk will focus on the unique challenges of running an open source data warehouse, and what it looks like to work with volunteers to create data pipelines.

In 2022, cloud data centres accounted for up to 3.7% of global greenhouse gas emissions, exceeding those of aviation and shipping. Yet in the same year, Britain wasted 4 Terawatt hours of renewable energy because it couldn’t be transported from where it was generated to where it was needed. So why not move the cloud to the clean energy? VertFlow is an Airflow operator that deploys workloads to the greenest Google Cloud data centre, based on the realtime carbon intensity of electricity grids worldwide. At Ovo Energy, many of our batch workloads, like generation forecasts, don’t have latency or data residency requirements, so they can run anywhere. We use VertFlow to let them chase the sun to wherever energy is greenest, helping us save carbon on our mission to save carbon. VertFlow is available on PyPI: https://pypi.org/project/VertFlow/ Find out more at https://cloud.google.com/blog/topics/sustainability/ovo-energy-builds-greener-software-with-google-cloud

A steady rise in users and business critical workflows poses challenges to development and production workflows. The solution: enable multi-tenancy on our single Airflow instance. We needed to enable teams to manage their python requirements, and ensure DAGs were insulated from each other. To achieve this we divided our monolithic setup into three parts: Infrastructure (with common code packaging), Workspace Creation, and CI/CD to manage deployments. Backstage templates enable teams to create isolated development environments that resemble our production environment, ensuring consistency. Distributing common code via a private pypi gives teams more control over what code their DAGs run. And a PythonOperator Shim in production utilizes virtualenv to run Python code with each team’s defined requirements for their DAG. In doing these things we enable effective multi-tenancy, and facilitate easier development and production workflows for Airflow.

DAG Authoring - learn how to go beyond the basics and best practices when implementing Airflow DAGs. It will be a survival guide for Airflow DAG developers who need to cope with hundreds of Airflow operators. This session will go beyond 101 or “for dummies” session and will be of interest to both those who are just starting to develop Airflow DAGs and Airflow experts, as it will help them improve their productivity.

As big Airflow users grow their usage to hundreds of DAGs, parsing them can become a performance bottleneck in the scheduler. In this talk, we’ll explore how this situation was improved by using caching techniques and pre-processing of DAGs to minimize the overhead of parsing them at runtime. We’ll also touch on how the the performance of the existing code was analyzed to find points of improvement. We may include a section on how to configure airflow to benefit from those recent changes, and some tips on how to make DAGs that are quick to parse, but this will not be the core of the talk. The talk is intended for contributors and anyone interested in working on performance improvement in general.

This talk is speculative: orchestration tools like Airflow have it made it very easy to pull and push data from anywhere to everywhere. But we don’t know what data we are pushing around. What if we have a schema language that we could use to describe this data? Not in terms of data type but in terms of sensitivity and instructions on how to handle this? This talk is about the headaches companies are facing day to day and that maybe there’s an opportunity for the Airflow community to help solve this problem.

This talk will cover in high overview the architecture of a data product DAG, the benefits in a data mesh world and how to implement it easily. Airflow is the de-facto orchestrator we use at Astrafy for all our data engineering projects. Over the years we have developed deep expertise in orchestrating data jobs and recently we have adopted the “data mesh” paradigm of having one Airlfow DAG per data product. Our standard data product DAGs contain the following stages: Data contract: check integrity of data before transforming the data Data transformation: applies dbt transformation via a kubernetes pod operator Data distribution: mainly informing downstream applications that new data is available to be consumed For use cases where different data products need to be finished before triggering another data product, we have a mechanism with an engine in between that keeps track of finished dags and triggers DAGs based on a mapping table containing data products dependencies.

In large organizations, data workflows can be complex and interconnected, with multiple dependencies and varied runtime requirements. To ensure efficient and timely execution of workflows, it is important to understand the factors that affect the performance of the system, such as network congestion, resource availability, and DAG structure. In this talk, we will explore how delay modeling and DAG connectivity analysis can be used to optimize Airflow performance in large organizations. We will present a network analysis of an airflow instance with multiple interconnected DAGs, and demonstrate how delay modeling can be used to estimate maximum delay and identify bottlenecks in the system. We will also discuss how the delay model can be used to optimize runtime and improve overall system performance.