Airflow Summit 2023

Accelerating Data Delivery: How the FT automated its ETL pipelines with Airflow

2023-07-01

session

Vladi Nekolov , Zdravko Hvarlingov

Airflow DWH ETL/ELT

Inside the Financial Times, we’ve been gradually moving our batching data processing from a custom solution to Airflow. To enable various teams within the company to use Airflow more effectively, we’ve been working on extending the system’s self-service capabilities. This includes giving ownership to teams of their DAGs and separating resources such as connections. The batch data ingestion processes are the main ETL - like jobs that we run on Airflow. The creation of a new job used to be a manual and repetitive task of receiving the data specification, creating the requisite tables in our data warehouse and writing the DAG that would move the data there. Airflow allowed us to automate this process to a degree that surprised us, completely removing the need to write DAG code. We will use the talk to describe what the current process of creating a new ETL workflow looks like and our plans for further improvements.

A Hypervisor for Airflow, presented by Astronomer

2023-07-01

session

Vikram Koka (Astronomer) , Viraj Parekh

Airflow Astronomer

Over the last few years, we’ve spent countless hours talking to data engineers from everywhere from Fortune 500s to seed stage startups. In doing so, we’ve learned all about what it takes to deliver a world class Airflow service perfect for everyone. We’ve packaged all that up into The Astro Hypervisor, a new part of our platform that gives users a whole new level of control in Airflow. We’ll talk through how we’ve built this hypervisor and how our customers will be able to use it for autoscaling, tracking the health of Airflow environments and so much more.

AI/ML is Changing Orchestration: How Kubernetes can accelerate Airflow

2023-07-01

session

Clayton Coleman

AI/ML Airflow Kubernetes LLM

It should be no surprise to the Airflow community that the hype around generative large language models (LLMs) and their wildly-inventive chat front ends have brought significant attention to growing these models and feeding them on a steady diet of data. For many communities in the infrastructure, orchestration, and data landscape this is an opportunity to think big, help our users scale, and make the right foundational investments to sustain that growth over the long term. In this keynote I’ll talk about my own community, Kubernetes, and how we’re using the surge in AI/ML excitement to address long standing gaps and unlock new capabilities. Not just for the workloads using GPUs and the platform teams supporting them, but thinking about how we can accelerate Airflow users and other key automators of workflow. We’re all in this together, and the future of orchestration is moving mountains of data at the speed of light!

Airflow as a Data Hybrid Cloud Orchestrator

2023-07-01

session

Jilan Kothakota , Alaeddine Maaoui

Airflow AWS Azure Cloud Computing

Apache Airflow is Scalable, Dynamic, Extensible , Elegant and can it be a lot more ? We have taken Airflow to the next level, using it as hybrid cloud data service accelerating our transformation. During this talk we will present the implementation of Airflow as an orchestration solution between LEGACY, PRIVATE and PUBLIC cloud (AWS / AZURE) : Comparison between public/private offers. Harness the power of Hybric cloud orchestrator to meet the regulatory requirements (European Financial Institutions) Real production use cases

Airflow at Asurion: Simplified orchestration at petabyte scale

2023-07-01

session

Rajesh Gundugollu

Airflow Data Lakehouse DWH

Workload Orchestration is at the heart of a successful Data lakehouse implementation. Especially for the “house” part which represents the Datawarehouse workloads which often are complex because of the very nature of warehouse data, which have dependency orchestration problems. We at Asurion have spent years in perfecting the Airflow solution to make it a super power for our Data Engineers. We have innovated in key areas like single operator for all use cases, auto DAG code generation, custom UI components for Data Engineers, monitoring tools etc. With over a few million job runs per year running on a platform with over 3 nines of availability, we have condensed years of our learnings into valuable ideas that can inspire and help all other Data enthusiasts. This session is going to walk the audience through some blind spots and pain points of Airflow architecture, scaling, engineering culture.

Airflow at Bloomberg: Leveraging dynamic DAGs for data ingestion

2023-07-01

session

Gabby Clavell , Ivan Sayapin

Airflow ETL/ELT

Bloomberg’s Data Platform Engineering team powers some of the most valuable business and financial data on which Bloomberg clients rely. We recently built a configuration-driven system that allows non-engineers to onboard alternative datasets into the company’s ecosystem. This system uses Apache Airflow to orchestrate the data flow across different applications and Bloomberg Terminal functions. We are unique in that we have over 1500 dynamic DAGs tailored for each dataset’s needs (which very few Airflow users have). In this talk, we will review our high-level Airflow architecture, how we leverage the dynamic DAGs in our ETL pipeline, as well as review some of the challenges we faced.

Airflow at Coinbase: How we supercharged the productivity of our users

2023-07-01

session

Jianlong Zhong

Airflow Data Science

At Coinbase, Airflow is adopted by a wide range of applications, and used by nearly all the engineering and data science teams. In this session, we will share our journey in improving the productivity of Airflow users at Coinbase. The presentation will focus on three main topics: Monorepo based architecture: our approach of using a monorepo to simplify DAG development and enable developers from across the company to work more efficiently and collaboratively. Tailored testing environment: our tailored Airflow testing environments that cater to users of different profiles, helping them to test their code more efficiently and with greater confidence. AirAgent: our in-house solution for Airflow continuous deployment, which puts Airflow deployment in self-driving mode and supports deploying any code changes related to Airflow (DAGs, plugins, configurations, dependency changes, etc.) without downtime.

Airflow at Delivery Hero: Running a data mesh with ~500 Airflow instances

2023-07-01

session

M. Waqas Shahid

AI/ML Airflow Analytics

Ever thought how airflow could play a pivotal role in data mesh architecture, hosting thousands of DAGs and hundreds of thousands daily running tasks, let’s find out! Delivery Hero delivers food in 70 countries with 12 different brands and platforms. With thousands of engineers, analysts and data scientists spread across many countries running analytics and ML services for all these orders delivered. Serving the workflow orchestration needs for such a massive group becomes a challenge. This is where airflow and data mesh comes to rescue, by running more than 500 airflow instances to empower different teams to own and curate data products. This presentation will explain how to efficiently setup and monitor airflow at massive scale. New feature of launching dynamic airflow staging and development environments dedicated for each developer. Demo about “Workspace” concept in direction of multi-tenancy management.

Airflow at Faire: Democratizing ML feature store framework at scale

2023-07-01

session

Victoria Varney (Astronomer) , Rafay Aleem (Faire)

AI/ML Airflow Data Science

Data science and machine learning are at the heart of Faire’s industry-celebrated marketplace (a16z top-ranked marketplace) and drive powerful search, navigation, and risk functions which are powered by ML models that are trained on 3000+ features defined by our data scientists. Previously, defining, backfilling and maintaining feature lifecycle was error-prone. Having a framework built on top of Airflow has empowered them to maintain and deploy their changes independently. We will explore: How to leverage Airflow as a tool that can power ML training and extend it with a framework that powers feature store. Enabling data scientists to define new features and backfill them (common problem in the ML world) using dynamic DAGs. The talk will provide valuable insights into how Faire constructed a framework that builds datasets to train models. Plus, how empowering end-users with tools isn’t something to fear but frees up engineering teams to focus on strategic initiatives.

Airflow at GoDaddy: From on-prem to cloud to PaaS

2023-07-01

session

Amit Kumar

Airflow AWS Glue Cloud Computing Data Lake GitHub

Discover the transformation of Airflow at GoDaddy: from its initial deployment on-prem to its migration to the cloud, and finally to a Single Pane Orchestration Model. This evolution has streamlined our Data Platform and improved governance. Our experience will be beneficial for anyone seeking to optimize their Airflow implementation and simplify their orchestration processes. History and Use-cases Design, Organization decisions, and Governance: Examining the decision-making process and governance structure. Migration to Cloud:Process of transitioning Airflow from on-premises to the cloud. Data Processing engines used with Airflow for Data Processing. Challenges: Obstacles faced during and after migration and how they were overcome. *Demonstrating how Airflow can be integrated with a central Glue Catalog and Data Lake Mesh model. Single Pane Orchestration (PAAS) and custom re-usable Github Actions: Examining benefits of using a Single Pane Orchestration model Monitoring

Airflow at Gojek: Streamlining data processing for Tableau dashboards

2023-07-01

session

Wanda Kinasih

Airflow BigQuery Tableau

With millions of orders per day, Gojek needs a data processing solution that can handle a high volume of data. Airflow is a scalable tool that can handle large volumes of data and complex workflows, making it an ideal solution for Gojek’s needs. With Airflow, we can create automated data pipelines to extract data from various sources, transform it, and load it into dashboards such as Tableau for analysis and visualization. This eliminates the need for manual data transfers and reduces the risk of errors. Airflow also can be integrated with other systems such as Google BigQuery. This allows us to easily connect to different data sources and transform the data before loading it into the dashboards. Therefore we can create seamless data pipelines. Using Airflow will also reduce query volume in Google BigQuery, resulting in lower cost of overall data infrastructure.

Airflow at Monzo: Evolving our data platform as the bank scales

2023-07-01

session

Jonathan Rainer , Ed Sparkes (Monzo - Making money work for everyone)

Airflow DWH Kubernetes

As a bank Monzo has seen exponential growth in active users, from 1.6 million in 2019 to 5.8 million in 2022. At the same time the number of data users and analysts has expanded from an initial team of 4 to 132. Alongside this growth, our infrastructure and tooling have had to evolve to deliver the same value at a new scale. From an Airflow installation deployed on a single monolithic instance we now deploy atop Kubernetes and have integrated our Airflow setup into the bank’s backend systems. This talk charts the story of that expansion and the growing pains we’ve faced, as well as looking to the future of our use of Airflow. We’ll first discuss how data at Monzo works, from event capture to arrival in our Data Warehouse, before assessing the challenges of our Airflow setup. We’ll then dive into the re-platforming that was required to meet our growing data needs, and some of the unique challenges that come with serving an ever growing user base and need for analysis and insight.

Airflow at Reddit: How we migrated from Airflow 1 to Airflow 2

2023-07-01

session

Branden West , Dave Milmont

Airflow Kubernetes

We would love to speak about our experience upgrading our old airflow 1 infrastructure to airflow 2 on kubernetes and how we orchestrated the migration of approximately 1500 DAGs that were owned by multiple teams in our organization. We had some interesting challenges along the way and can speak about our solutions. Points we can talk about: Old airflow 1 infrastructure and why we decided to move to kubernetes for airflow 2. Possible migration paths we thought of and why we chose the route we did. Things we did to make the migration easier to achieve: Implementing dag factories - used some neat programmatic approaches to make a great factory interface for our users. Custom cross airflow instance dag dependency solution. DAG audits - how we programmatically determined which dags were actually still being used to reduce migration load. Problems that we faced: DAG ownership Backfilling in airflow 2 k8s DAG dependencies

Airflow at Salesforce: Building a fully managed workflow orchestration system

2023-07-01

session

Raj Ramalingam , Varun Srinivas

Airflow CI/CD Cloud Computing Data Lake Kubernetes Cyber Security

In this presentation, we discuss how we built a fully managed workflow orchestration system at Salesforce using Apache Airflow to facilitate dependable data lake infrastructure on the public cloud. We touch upon how we utilized kubernetes for increased scalability and resilience, as well as the most effective approaches for managing and scaling data pipelines. We will also talk about how we addressed data security and privacy, multitenancy, and interoperability with other internal systems. We discuss how we use this system to empower users with the ability to effortlessly build reliable pipelines that incorporate failure detection, alerting, and monitoring for deep insights through monitoring, removing the undifferentiated heavy lifting associated with running and managing their own orchestration engines. Lastly, we elaborate on how we integrated our in-house CI/CD pipelines to enable effective DAG and dependency management, further enhancing the system’s capabilities.

Airflow at Snap: Managing permissions, migrations and internal tools

2023-07-01

session

Nanxi Chen , Han Gan , Zhengyi Liu , Yuri Desyatnik

Airflow GCP Kubernetes Python

We will cover how Snap (parent company of Snapchat) has been using Airflow since 2016. How we built a secure deployment on GCP that integrates with internal tools for workload authorization, RBAC and more. We made permissions for DAGs easy to use for customers using k8s workload identity binding and tight UI integration. How are we migrating 2500+ DAGs from Airflow V1, Python 2 to V2 Python 3 using tools + automations. Making code/DAG migration requires significant amount of time investment. Our team created several tools that can convert or re-write DAGs in the new format. Some other self-serving tools that we built internally.

Airflow at StyleSeat: Our journey, challenges & results

2023-07-01

session

Ramya Pappu , Kunal Haria

Airflow AWS CloudWatch AWS Lambda dbt Docker

We will share the case study of Airflow at StyleSeat, where within a year our data grew from 2 million data points per day to 200 million. Our original solution for orchestrating this data was not enough, so we migrated to an Airflow based solution. Previous implementation Our tasks were orchestrated with hourly triggers on AWS Cloudwatch rules in their own log groups. Each task was a lambda individually defined as a task and executed python code from a docker image. As complexity increased, there were frequent downtimes and manual executions for failed tasks and their downstream dependencies.With every downtime, our business stakeholders started losing trust in Data and recovery times were longer with each downtime. We needed a modern orchestration platform which would enable our team to define and instrument complex pipelines as code, provide visibility into executions and define retry criteria on failures. Airflow was identified as a crucial & critical piece in modernizing our orchestration which would help us further onboard DBT. We wanted a managed solution and a partner who could help guide us to a successful migration.

Airflow at The Home Depot Canada: Observable orchestration platform for data integration and ML

2023-07-01

session

Jose Puertos

AI/ML Airflow Beam DevOps

The purpose of this session is to indicate how we leverage airflow in a federated way across all our business units to perform a cost-effective platform that accommodates different patterns of data integration, replication and ML tasks in a flexible way providing DevOps tunning of DAGs across environments that integrate to our open-source observability strategy that allows our SREs to have a consistent metrics, monitoring and alerting of data tasks. We will share the opinionated way we setup DAGs that include naming and folder structure conventions along coding expectation like the use of XCom specific entries to report processed elements and support for state for DAGs that require it as well as the expected configurable capabilities for tasks such as the type of runner for Apache Beam tasks. Along these ones we will also indicate the “DevOps DAGs” that we deploy in all our environments that take care of specific platform maintenance/support.

Airflow at Twitch: Our recommendation system starring Airflow

2023-07-01

session

Ritika Jain

Airflow Data Streaming

Twitch, the world’s leading live streaming platform, has a massive user base of over 140 million active users and an incredibly complex recommendation system to deliver a personalized and engaging experience to its users. In this talk, we will dive into how Twitch leverages the power of Apache Airflow to manage and orchestrate the training and deployment of its recommendation models. You will learn about the scale of Twitch’s reach and the challenges we faced in building a scalable, reliable, and developer-friendly recommendation system. We will also highlight the custom tooling built internally to make it easier for Twitch’s applied scientists to iterate and develop confidently with Airflow. These customizations have helped Twitch streamline its processes, control costs, improve collaboration between teams, and ensure a seamless experience for internal users of Airflow.

Airflow at UniCredit: Our journey from mainframe scheduling to modern data processing

2023-07-01

session

Jędrzej Matuszak , Jan Pawłowski (UniCredit)

Airflow CI/CD ETL/ELT

Representing the Murex Reporting team at UniCredit we would like to present our journey with Airflow, and how over the past two years it enabled us to automate and simplify our batch workflows. Comparing to our previous rigid mainframe scheduling approach, we have created a robust and scalable framework complete with a CI/CD process, bringing our time to market of scheduling changes down from 3 days to 1. Basing our solution on DAG networks joined by ResumeDagRunOperators and an array of custom-built plugins (such as static time predecessors) we were able to replicate the scheduling of our overnight ETL processes (consisting of approx. 8000 tasks with many-to-many dependencies) in Airflow, satisfying our bank reporting SLAs without performance regression and gaining massively improved process visibility and control. Our presentation will illustrate our journey and explore some of these customizations, which venture outside of Airflow’s core functionalities.

Airflow Driven Data Lineage In Public Cloud

2023-07-01

session

Michał Modras (Google)

Airflow Cloud Computing Cloud Composer

The session will cover capabilities of data lineage in Apache Airflow, how to use them, and motivations for it. It will present the technical know-how of integrating data lineage solutions with Apache Airflow, and provisioning DAGs metadata to fuel lineage functionalities in a way transparent to the user, limiting the setup friction. It will include Google’s Cloud Composer lineage integration implemented through the current Airflow’s data lineage architecture, and our approach to the lineage evolution strategy.

Airflow Executors: Past, present and future

2023-07-01

session

Niko Oliveira (Amazon | Apache Airflow Comitter)

Airflow AWS Kubernetes

Executors are a core concept in Apache Airflow and are an essential piece to the execution of DAGs. They have seen a lot of investment over the year and there are many exciting advancements that will benefit both users and contributors. This talk will briefly discuss executors, how they work and what they are responsible for. It will then describe Executor Decoupling (AIP-51) and how this has fully unlocked development of third-party executors. We’ll touch on the migration of “core” executors (such as Celery and Kubernetes) to their own package as well as the addition of new “3rd party” executors from providers such as AWS. Finally, a description/demo of Hybrid Executors, a proposed new feature to allow multiple executors to be used natively and seamlessly side by side within a single Airflow environment; which will be a powerful feature in a future full of many new Airflow executors.

Airflow: Under the hood

2023-07-01

session

Utkarsh Sharma

Airflow Python

Making a contribution to or becoming a committer on Airflow can be a daunting task, even for experienced Python developers and Airflow users. The sheer size and complexity of the code base may discourage potential contributors from taking the first steps. To help alleviate this issue, this session is designed to provide a better understanding of how Airflow works and build confidence in getting started. During the session, we will introduce the main components of Airflow, including the Web Server, Scheduler, and Workers. We will also cover key concepts such as DAGs, DAG-run objects, Tasks, and Task Instances. Additionally, we will explain how tasks communicate with each other using XComs, and discuss the frequency of DAG runs based on the schedule. To showcase changes in the state of various objects, we will dive into the code level and continuously share the state of the database at every important checkpoint.

A New SQLAlchemyCollector and OpenLineageAdapter for Emitting Airflow Lineage Metadata as DAGs Run

2023-07-01

session

Michael Robinson

Airflow Astronomer

Airflow uses SQLAlchemy under the hood but up to this point has not exploited the tool’s capacity to produce detailed metadata about queries, tables, columns, and more. In fact, SQLAlchemy ships with an event listener that, in conjunction with OpenLineage, offers tantalizing possibilities for enhancing the development process – specifically in the areas of monitoring and debugging. SQLAlchemy’s event system features a Session object and ORMExecuteState mapped class that can be used to intercept statement executions and emit OpenLineage RunEvents as executions occur. In this talk, Michael Robinson from the community team at Astronomer will provide an overview and demo of new SQLAlchemyCollector and OpenLineageAdapter classes for leveraging SQLAlchemy’s event system to emit OpenLineage events as DAGs run.

An Introduction to Airflow Cluster Policies

2023-07-01

session

Philippe Gagnon

Airflow

Cluster Policies are an advanced Airflow feature composed of a set of hooks that allow cluster administrators to implement checks and mutations against certain core Airflow constructs (DAGs, Tasks, Task Instances, Pods). In this talk, we will discuss how cluster administrators can leverage these functions in order to better govern the workloads that are running in their environments.

Apache Airflow and OpenTelemetry

2023-07-01

session

Dennis Ferruzzi , Howard Yoo

Airflow Analytics

OpenTelemetry is a vendor-neutral open-source (CNCF) observability framework that is supported by many vendors industry-wide. It is used for instrumenting, generation, collection, and exporting of data within systems which then are ingested by analytics tools that can provide tracing, metrics, and logs. It has long been the plan to adopt the OTel standard within Airflow, allowing builders and users to take advantage of valuable data that could help improve the efficiency, cost and performance of their systems. Let us talk about the journey that started a few years ago to bring this dream to reality.

talk-data.com

Top Topics

Top Speakers

Accelerating Data Delivery: How the FT automated its ETL pipelines with Airflow

A Hypervisor for Airflow, presented by Astronomer

AI/ML is Changing Orchestration: How Kubernetes can accelerate Airflow

Airflow as a Data Hybrid Cloud Orchestrator

Airflow at Asurion: Simplified orchestration at petabyte scale

Airflow at Bloomberg: Leveraging dynamic DAGs for data ingestion

Airflow at Coinbase: How we supercharged the productivity of our users

Airflow at Delivery Hero: Running a data mesh with ~500 Airflow instances

Airflow at Faire: Democratizing ML feature store framework at scale

Airflow at GoDaddy: From on-prem to cloud to PaaS

Airflow at Gojek: Streamlining data processing for Tableau dashboards

Airflow at Monzo: Evolving our data platform as the bank scales

Airflow at Reddit: How we migrated from Airflow 1 to Airflow 2

Airflow at Salesforce: Building a fully managed workflow orchestration system

Airflow at Snap: Managing permissions, migrations and internal tools

Airflow at StyleSeat: Our journey, challenges & results

Airflow at The Home Depot Canada: Observable orchestration platform for data integration and ML

Airflow at Twitch: Our recommendation system starring Airflow

Airflow at UniCredit: Our journey from mainframe scheduling to modern data processing

Airflow Driven Data Lineage In Public Cloud

Airflow Executors: Past, present and future

Airflow: Under the hood

A New SQLAlchemyCollector and OpenLineageAdapter for Emitting Airflow Lineage Metadata as DAGs Run

An Introduction to Airflow Cluster Policies

Apache Airflow and OpenTelemetry