Airflow Summit 2023

A Single Pane of Glass on Airflow using Astro Python SDK, Snowflake, dbt, and Cosmos

2023-07-01

session

Luan Moreno Medeiros Maciel

Airflow Astronomer Cosmos dbt DWH ETL/ELT

ETL data pipelines are the bread and butter of data teams that must design, develop, and author DAGs to accommodate the various business requirements. dbt is becoming one of the most used tools to perform SQL transformations on the Data Warehouse, allowing teams to harness the power of queries at scale. Airflow users are constantly finding new ways to integrate dbt with the Airflow ecosystem and build a single pane of glass where Data Engineers can manage and administer their pipelines. Astronomer Cosmos, an open-source product, has been introduced to integrate Airflow with dbt Core seamlessly. Now you can easily see your dbt pipelines fully integrated on Airflow. You will learn the following: How to integrate dbt Core with Airflow How to use Cosmos How to build data pipelines at scale

Better Airflow with Metaflow : A modern human-centric ML infrastructure stack

2023-07-01

session

Savin Goyal , Ryan Delgado

AI/ML Airflow Data Engineering

Airflow is a household brand in data engineering: It is readily familiar to most data engineers, quick to set up, and, as proven by millions of data pipelines powered by it since 2014, it can keep DAGs running. But with the increasing demands of ML, there is a pressing need for tools that meet data scientists where they are and address two pressing issues - improving the developer experience & minimizing operational overhead. In this talk, we discuss the problem space and the approach to solving it with Metaflow, the open-source framework we developed at Netflix, which now powers thousands of business-critical ML projects at Netflix & other companies. We wanted to provide data scientists with the best possible UX, allowing them to focus on parts they like (e.g., modeling) while providing robust solutions for the foundational infrastructure: data, compute, orchestration (using Airflow), & versioning. In this talk, we will demo our latest work that builds on top of Airflow.

Better Support for Using Multiple Namespaces with KubernetesExecutor

2023-07-01

session

Airflow GitHub Kubernetes Cyber Security

Airflow’s KubernetesExecutor has supported multi_namespace_mode for long time. This feature is great at allowing Airflow jobs to run in different namespaces on the same Kubernetes clusters for better isolation and easier management. However, this feature requires cluster-role for the Airflow scheduler, which can create security problems or be a blocker for some users. PR https://github.com/apache/airflow/pull/28047 , which will become available in Airflow 2.6.0, resolves this issue by allowing Airflow users to specify multi_namespace_mode_namespace_list when using multi_namespace_mode, so that no cluster-role is needed and user only needs to ensure the Scheduler has permissions to certain namespaces rather than all namespaces on the Kubernetes cluster. This talk aims to help you better understand KubernetesExecutor and how to set it up in a more secure manner.

Beyond Data Engineering: Airflow for Operations

2023-07-01

session

Ryan Hatter

Airflow Astronomer Data Engineering ETL/ELT Kubernetes

Much of the world sees Airflow as a hammer and ETL tasks as nails, but in reality, Airflow is much more of a sophisticated multitool, capable of orchestrating a wide variety of complex workflows. Astronomer’s Customer Reliability Engineering (CRE) team is leveraging this potential in its development of Airline, a tool powered by Airflow that monitors Airflow deployments and sends alerts proactively when issues arise. In this talk, Ryan Hatter from Astronomer will give an overview of Airline. He’ll explain how it integrates with ZenDesk, Kubernetes, and other services to resolve customers’ problems more quickly, and in many cases, even before customers realize there’s an issue. Join us for a practical exploration of Airflow’s capabilities beyond ETL, and learn how proactive, automated monitoring can enhance your operations.

Building a Commercial Service with Open-Source Community Focus, presented by AWS

2023-07-01

session

John Jackson

Airflow AWS

Amazon Managed Workflows for Apache Airflow (MWAA) was released in November 2020. Throughout MWAA’s design we held the tenets that this service would be open-source first, not forking or deviating from the project, and that the MWAA team would focus on improving Airflow for everyone—whether they run Airflow on MWAA, on AWS, or anywhere else. This talk will cover some of the design choices made to facilitate those tenets, how the organization was set up to contribute back to the community, what those contributions look like today, how we’re getting those contributions in the hands of users, and our vision for future engagement with the community.

Building and deploying LLM applications with Apache Airflow

2023-07-01

session

Kaxil Naik , Julian LaNeve

AI/ML Airflow LLM

Behind the growing interest in Generate AI and LLM-based enterprise applications lies an expanded set of requirements for data integrations and ML orchestration. Enterprises want to use proprietary data to power LLM-based applications that create new business value, but they face challenges in moving beyond experimentation. The pipelines that power these models need to run reliably at scale, bringing together data from many sources and reacting continuously to changing conditions. This talk focuses on the design patterns for using Apache Airflow to support LLM applications created using private enterprise data. We’ll go through a real-world example of what this looks like, as well as a proposal to improve Airflow and to add additional Airflow Providers to make it easier to interact with LLMs such as the ones from OpenAI (such as GPT4) and the ones on HuggingFace, while working with both structured and unstructured data. In short, this shows how these Airflow patterns enable reliable, traceable, and scalable LLM applications within the enterprise.

Building an Open Source Data Warehouse

2023-07-01

session

Jonathan Leek

Airflow DWH

Volunteers in Saint Louis are using Airflow to build an open source data warehouse of real estate data (permits, assessments, violations, etc), with an eye towards creating a national open data standard. This talk will focus on the unique challenges of running an open source data warehouse, and what it looks like to work with volunteers to create data pipelines.

Change Management Done Right Across Environments and Tools: DAGs, datasets and visualizations

2023-07-01

session

Maxime Beauchemin (Preset)

CI/CD

Change management in data teams can be challenging to say the least. Not only you have to evolve your data pipelines, data structures, and datasets themselves across environments, you also have to keep data exploration and visualizations tools in sync. In this talk, we’ll be exploring how to do this best across environments (ie: dev, staging and prod), talking about how CI/CD can help, implementing good data ops practices and cranking up the level of rigor where it matters. We’ll also talk about rigor-vs-speed tradeoffs, where clearly not all data pipelines are born equal, and how to think about to evolve the level of rigor over time in places where it matters most.

Chase The Sun: Build greener DAGs with VertFlow

2023-07-01

session

Jack Lockyer-Stevens

Airflow Cloud Computing GCP

In 2022, cloud data centres accounted for up to 3.7% of global greenhouse gas emissions, exceeding those of aviation and shipping. Yet in the same year, Britain wasted 4 Terawatt hours of renewable energy because it couldn’t be transported from where it was generated to where it was needed. So why not move the cloud to the clean energy? VertFlow is an Airflow operator that deploys workloads to the greenest Google Cloud data centre, based on the realtime carbon intensity of electricity grids worldwide. At Ovo Energy, many of our batch workloads, like generation forecasts, don’t have latency or data residency requirements, so they can run anywhere. We use VertFlow to let them chase the sun to wherever energy is greenest, helping us save carbon on our mission to save carbon. VertFlow is available on PyPI: https://pypi.org/project/VertFlow/ Find out more at https://cloud.google.com/blog/topics/sustainability/ovo-energy-builds-greener-software-with-google-cloud

Circumventing Airflow's Limitations around Multitenancy

2023-07-01

session

Akshay Battaje (Wealthsimple Inc.) , Anthony Kalsatos (Wealthsimple Inc)

Airflow CI/CD Python

A steady rise in users and business critical workflows poses challenges to development and production workflows. The solution: enable multi-tenancy on our single Airflow instance. We needed to enable teams to manage their python requirements, and ensure DAGs were insulated from each other. To achieve this we divided our monolithic setup into three parts: Infrastructure (with common code packaging), Workspace Creation, and CI/CD to manage deployments. Backstage templates enable teams to create isolated development environments that resemble our production environment, ensuring consistency. Distributing common code via a private pypi gives teams more control over what code their DAGs run. And a PythonOperator Shim in production utilizes virtualenv to run Python code with each team’s defined requirements for their DAG. In doing these things we enable effective multi-tenancy, and facilitate easier development and production workflows for Airflow.

Cross Environment Event Based Triggers with Airflow

2023-07-01

session

Kunal Jain

Airflow

In an environment with multiple Airflow instances , how we build custom operators and framework to share events across the instances and trigger dags based on those events

DAG Authoring without PhD, presented by Google Cloud

2023-07-01

session

Rafal Biegacz , Filip Knapik

Airflow Cloud Computing GCP

DAG Authoring - learn how to go beyond the basics and best practices when implementing Airflow DAGs. It will be a survival guide for Airflow DAG developers who need to cope with hundreds of Airflow operators. This session will go beyond 101 or “for dummies” session and will be of interest to both those who are just starting to develop Airflow DAGs and Airflow experts, as it will help them improve their productivity.

DAG Parsing Optimizations

2023-07-01

session

Raphaël Vandon

Airflow

As big Airflow users grow their usage to hundreds of DAGs, parsing them can become a performance bottleneck in the scheduler. In this talk, we’ll explore how this situation was improved by using caching techniques and pre-processing of DAGs to minimize the overhead of parsing them at runtime. We’ll also touch on how the the performance of the existing code was analyzed to find points of improvement. We may include a section on how to configure airflow to benefit from those recent changes, and some tips on how to make DAGs that are quick to parse, but this will not be the core of the talk. The talk is intended for contributors and anyone interested in working on performance improvement in general.

Data about data: the case for a privacy centric schema language

2023-07-01

session

Diederik van Liere

Airflow

This talk is speculative: orchestration tools like Airflow have it made it very easy to pull and push data from anywhere to everywhere. But we don’t know what data we are pushing around. What if we have a schema language that we could use to describe this data? Not in terms of data type but in terms of sensitivity and instructions on how to handle this? This talk is about the headaches companies are facing day to day and that maybe there’s an opportunity for the Airflow community to help solve this problem.

Data at Rest: Bringing granular quality into flowing pipelines

2023-07-01

session

Mauricio De Diana , C.J. Jameson

Data Quality Monte Carlo

You’ve got your pipelines flowing … how much do you know about the data inside? Most teams have some coverage with unit/contract/expectations tests, and you might have other quality checks. But it can be very ad-hoc and disorganized. You want to do more to beef up data quality and observability … does that mean you just need to write more tests and assertions? Come learn about the best way to see your data’s quality alongside DAGs in a familiar context. We’ll review 3 common tools to get a handle on quality in a cohesive way across all your DAGs: Great Expectations Monte Carlo Data Databand

Data Product DAGs

2023-07-01

session

Andrea Bombino

Airflow Data Contracts Data Engineering dbt Kubernetes

This talk will cover in high overview the architecture of a data product DAG, the benefits in a data mesh world and how to implement it easily. Airflow is the de-facto orchestrator we use at Astrafy for all our data engineering projects. Over the years we have developed deep expertise in orchestrating data jobs and recently we have adopted the “data mesh” paradigm of having one Airlfow DAG per data product. Our standard data product DAGs contain the following stages: Data contract: check integrity of data before transforming the data Data transformation: applies dbt transformation via a kubernetes pod operator Data distribution: mainly informing downstream applications that new data is available to be consumed For use cases where different data products need to be finished before triggering another data product, we have a mechanism with an engine in between that keeps track of finished dags and triggers DAGs based on a mapping table containing data products dependencies.

Deferrable Operators

2023-07-01

session

Syed Hussain

AWS

Deep dive into how AWS is developing Deferrable Operators for the Amazon Provider Package to help users realize the potential cost-savings provided by Deferrable Operators, and promote their usage.

Delay Modeling and DAG Connectivity: Optimizing Airflow performance in large organizations

2023-07-01

session

Ahuitz Rojas

Airflow

In large organizations, data workflows can be complex and interconnected, with multiple dependencies and varied runtime requirements. To ensure efficient and timely execution of workflows, it is important to understand the factors that affect the performance of the system, such as network congestion, resource availability, and DAG structure. In this talk, we will explore how delay modeling and DAG connectivity analysis can be used to optimize Airflow performance in large organizations. We will present a network analysis of an airflow instance with multiple interconnected DAGs, and demonstrate how delay modeling can be used to estimate maximum delay and identify bottlenecks in the system. We will also discuss how the delay model can be used to optimize runtime and improve overall system performance.

Demystifying Apache Airflow: Separating facts from fiction

2023-07-01

session

Uma Ramadoss , Shubham Mehta (AWS Analytics)

Airflow Dagster Prefect

Apache Airflow is a popular workflow platform, but it often faces critiques that may not paint the whole picture. In this talk, we will unpack the critiques of Apache Airflow and provide a balanced analysis. We will highlight the areas where these critiques correctly point out Airflow’s weaknesses, debunk common myths, and showcase where competitors like Dagster and Prefect are excelling. By understanding the pros and cons of Apache Airflow, attendees will be better equipped to make informed decisions about whether Airflow is the right choice for their use cases. This talk will provide a comprehensive and objective assessment of Apache Airflow and its place in the workflow management ecosystem. Notes: What Critics Get Right About Airflow’s Weaknesses Debunking Myths and Misconceptions About Airflow Competitor Analysis Real-World Use Cases: When Airflow Shines Making Informed Decisions: Choosing the Right Workflow Platform

Eat, Sleep, Test, Repeat: How King ensures always-on data

2023-07-01

session

Nathan Hadfield

Airflow

At King, data is fundamental in helping us deliver the best possible experiences for the players of our games while continually bringing them new, innovative and evolving gameplay features. Data has to be “always-on”, where downtime and accuracy is treated with the same level of diligence as any of our games and success is measured against internal SLAs. How is King using ‘data reliability engineering as code’ tools such as SodaCore within Airflow pipelines to detect, diagnose and inform about data issues to create coverage, improve quality & accuracy and help eliminate data downtime?

Elevating Data Quality: Great Expectations and Airflow at PepsiCo

2023-07-01

session

Russell Lamb

Airflow Cloud Computing Data Engineering Data Quality

Discover PepsiCo’s dynamic data quality strategy in a multi-cloud landscape. Join me, the Director of Data Engineering, as I unveil our Airflow utilization, custom operator integration, and the power of Great Expectations. Learn how we’ve harmonized Data Mesh into our decentralized development for seamless data integration. Explore our journey to maintain quality and enhance data as a strategic asset at PepsiCo.

Empowering Collaborative Data Workflows with Airflow and Cloud Services

2023-07-01

session

Hoa Nguyen , Stanisław Smyl

Airflow Cloud Computing dbt Kubernetes Python YAML

Productive cross-team collaboration between data engineers and analysts is the goal of all data teams, however, fulfilling on that mission can be challenging given the diverse set of skills that each group brings. In this talk we present an example of how one team tackled this topic by creating a flexible, dynamic and extensible framework using Airflow and cloud services that allowed engineers and analysts to jointly create data-centric micro-services to serve up projections and other robust analysis for use in the organization. The framework, which utilized dynamic DAG generation configured using yaml files, Kubernetes jobs and dbt transformations, abstracted away many of the details associated with workflow orchestration, allowing analysts to focus on their Python or R code and data processing logic while enabling data engineers to monitor the pipelines and ensure their scalability.

Enabling Data Mesh by Moving from a Monolithic Airflow to Several Smaller Environments

2023-07-01

session

Filip Kunčar , Stanislav Repka

Airflow Astronomer GCP Cyber Security

Kiwi.com started using Airflow in June 2016 as an orchestrator for several people in the company. The need for the tool grew and the monolithic instance was used by 30+ teams having 500+ DAGs active resulting in 3.5 million tasks/month successfully finished. At first, we moved to using a monolithic Airflow environment, but our needs quickly changed as we wanted to support a data mesh architecture within kiwi.com. By leveraging Astronomer on GCP, we were able to move from a monolithic Airflow environment to many smaller instances of Airflow. This talk will go into how to handle things like DAG dependencies, observability, and stakeholder management. Furthermore, we’ll talk about security, particularly how GCP’s workload identity helped us achieve a passwordless Airflow experience.

Event-based DAG Parsing: No more F5ing in the UI

2023-07-01

session

Bas Harenslak

Airflow

Have you ever added a DAG file and had no clue what happened to it? You’re not alone! With default settings, Airflow can wait up to 5 minutes before processing new DAG files. In this talk, I’ll discuss the implementation of an event-based DAG parser that immediately processes changes in the DAGs folder. As a result, changes are reflected immediately in the Airflow UI. In this talk I will cover: A demonstration of the event-based DAG parser and the fast Airflow UI experience How the current DAG parser implementation and configuration works How an event-based DAG parser is implemented

Featured Panels

2023-07-01

session

talk-data.com

Top Topics

Top Speakers

A Single Pane of Glass on Airflow using Astro Python SDK, Snowflake, dbt, and Cosmos

Better Airflow with Metaflow : A modern human-centric ML infrastructure stack

Better Support for Using Multiple Namespaces with KubernetesExecutor

Beyond Data Engineering: Airflow for Operations

Building a Commercial Service with Open-Source Community Focus, presented by AWS

Building and deploying LLM applications with Apache Airflow

Building an Open Source Data Warehouse

Change Management Done Right Across Environments and Tools: DAGs, datasets and visualizations

Chase The Sun: Build greener DAGs with VertFlow

Circumventing Airflow's Limitations around Multitenancy

Cross Environment Event Based Triggers with Airflow

DAG Authoring without PhD, presented by Google Cloud

DAG Parsing Optimizations

Data about data: the case for a privacy centric schema language

Data at Rest: Bringing granular quality into flowing pipelines

Data Product DAGs

Deferrable Operators

Delay Modeling and DAG Connectivity: Optimizing Airflow performance in large organizations

Demystifying Apache Airflow: Separating facts from fiction

Eat, Sleep, Test, Repeat: How King ensures always-on data

Elevating Data Quality: Great Expectations and Airflow at PepsiCo

Empowering Collaborative Data Workflows with Airflow and Cloud Services

Enabling Data Mesh by Moving from a Monolithic Airflow to Several Smaller Environments

Event-based DAG Parsing: No more F5ing in the UI

Featured Panels