Airflow Summit 2022

Airflow and _____: A discussion around utilizing Airflow with other data tools

2022-07-01

session

Alessandro Pregnolato , Jitendra Shah , Brad Kirn , Sarah Johnson

Airflow AWS Glue Azure ADF Databricks

Come hang with Airflow practitioners from around the world using Airflow AND other data tools to power their data practice. From Databricks to Glue to Azure Data Factory, smart businesses make the right decision to standardize on Airflow for what it’s best at while using the other systems for what they are best at.

Airflow at high scale for Autonomous Driving

2022-07-01

session

Philipp Lang , Anton Ivanov

Airflow postgresql Spark

This talk highlights a large-scale use case of Airflow to orchestrate workflows for an Autonomous Driving project based in Germany. To support our customer in the aim of producing their first Level-3 Autonomous Driving vehicle in Germany, we are utilising Airflow as a state-of-the-art tool to orchestrate workloads running on a large-scale HPC platform. In this talk, we will describe our Airflow setup deployed on OpenShift that is capable of running thousands of tasks in parallel and contains various custom improvements optimised for our use case. We will in detail discuss how we achieved to integrate multiple components with airflow such as a PostgreSQL database, a highly available RabbitMQ message broker and a fully integrated IAM solution. In particular, we will describe bottlenecks which we have encountered in our journey towards scaling up airflow, and how we mitigate them. We will also detail on the custom improvements we implemented on both airflow code base and deployment setup, such as an enhanced airflow logging framework, improvements on the Spark submit operator, and a feature to re-deploy airflow without business downtime. The talk will be concluded with an outlook on future improvements that we anticipate, as well as beneficial features we believe will be helpful for the community.

Airflow at Pinterest

2022-07-01

session

Dinghang Yu (Pinterest) , Yulei Li (Pinterest) , Ace Haidrey (Pinterest)

Airflow

Pinterest has been part of the Airflow community for two years and has worked on many custom solutions to address usability, scalability, and efficiency constraints. This session is to discuss how Pinterest has further expanded on those previous solutions. We will discuss how we work to further reduce system latencies, improve user development through added search features, support for cross cluster operations, and improved debuggability tooling, and system level efficiency improvements to auto retry failed tasks that meet certain criteria.

Airflow at Shopify: Keeping Users Happy while Running Airflow at Scale

2022-07-01

session

Sam Wheating , Megan Parker

Airflow

Two years after starting our Airflow adoption, we’re running over 10,000 DAGs in production. On this journey we’ve learned a lot about Airflow management and stewardship and developed some unique tools to help us scale. We’re excited to share our experience and some of the lessons we’ve picked up along the way. In this talk we’ll cover: The history of Airflow at Shopify Our infrastructure and architecture Custom tools and procedures we’ve adopted to keep Airflow running smoothly and our users happy

Airflow extensions for governing a self-serviced data mesh

2022-07-01

session

Jorrick Sleijster

Airflow

While many companies set up isolated data teams, Adyen is a strong believer of the data mesh approach, with all our data living in a central place. While our tooling teams provide and operate the on-premise cluster, the product teams are able to take full ownership of their data pipelines. Our 100+ users, spread across 10+ teams, own in total more than 200 dags and 4000 tasks. We use a single Airflow instance with many cross-dag and cross-stream dependencies within these 200 dags. As it’s impossible to keep track of all 4000 tasks as a single entity, these pipelines can only be managed by the teams themselves. In this presentation we will take a deep dive in how we structured DAG-level ACLs by using a governance library. Then, we continue by showcasing how we enabled our product teams to manage their tables with automated retention based removals, automated compaction, and table schema setup. Finally, we will showcase how making the teams able to manage the resources of their pipelines helped us iterate and saved us patching.

Airflow in the Cloud: Lessons from the Field

2022-07-01

session

Rafal Biegacz , Filip Knapik

Airflow Cloud Computing Kubernetes

Airflow users love to run Airflow in public clouds and on distributed infrastructures like Kubernetes. Running Airflow environments is easier than ever - community offers Helm-based installation for self-managed Airflow and there are many offerings of Airflow-based managed services. Commoditization of Airflow and broader Airflow user base brings new challenges. This talk presents observations of the Airflow service provider delivering “Airflow as a Service’’ to cloud users (very technical, less technical and not technical at all). Information presented during this talk will be directed to the Apache Airflow committers and contributors with the hope that one can influence Airflow’s future roadmap so that Apache Airflow becomes easy to use.

Airflow / Kubernetes: Running on and using k8s

2022-07-01

session

Jed Cunningham

Airflow Kubernetes

Apache Airflow and Kubernetes work well together. Not only does Airflow have native support for running tasks on Kubernetes, there is also an official helm chart that makes it easy to run Airflow itself on Kubernetes! Confused on the differences between KubernetesExecutor and KubernetesPodOperator? What about CeleryKubernetesExecutor? Or the new LocalKubernetesExecutor? After this talk you will understand how they all fit in the ecosystem. We will talk about the ways you can run Airflow on Kubernetes, run tasks on Kubernetes, or do both. We will also cover things you may want to consider doing to have a reliable Airflow instance.

Airflow & Zeppelin: Better together

2022-07-01

session

Jeff Zhang

Airflow Flink Big Data Hive Presto Spark

Airflow is the almost de-facto standard job orchestration tool that is used in the production stage. But moving your job from the development stage in other tools to the production stage in Airflow is usually a big pain for lots of users. A major reason is due to the environment inconsistency between the development environment and the production environment. Apache Zeppelin is a web-based notebook that is integrated seamlessly with lots of popular big data engines, such as Spark, Flink, Hive, Presto and etc. So it is very suitable for the development stage. In this talk, I will talk about the seamless integration between Airflow & Zeppelin, so that you can develop your big data job in Zeppelin efficiently and move to Airflow easily without caring too much about issues caused by the environment inconsistency.

All About Deferrables

2022-07-01

session

Andrew Godwin

Airflow

Airflow 2.2 introduced Deferrable Tasks (sometimes called “async operators”), a new mechanism to efficiently run tasks that depend on external activity. But when should you use them, how do they work, what do you need to do to make Operators support them, and what else could we do in Airflow with this model?

A look under the hood of the Airflow logging subsystem

2022-07-01

session

Philippe Gagnon

Airflow

The task logging subsystem is one of most flexible, yet complex and misunderstood components of Airflow. In this talk, we will take a look at the various task log handlers that are part of the core Airflow distribution, and dig a bit deeper in the interfaces they implement and discuss how those can be used to roll your own logging implementation.

An Introduction to Data Lineage with Airflow and Marquez

2022-07-01

session

Michael Robinson , Ross Turk

Airflow

This workshop is sold out Data lineage might seem like a complicated and unapproachable topic, but that’s only because data pipelines are complicated. The core concept is straightforward: trace and record the journey of datasets as they travel through a data pipeline. Marquez, a lineage metadata server, is a simple thing designed to watch complex things. It tracks the movement of data through complex pipelines using a straightforward, clear object model of Jobs, Datasets, and Runs. The information it gathers can be used to help you more effectively understand, communicate, and solve problems. The interactive UI allows you to see exactly where any inefficiencies have developed or datasets have become compromised. In this workshop, you will learn how to collect and visualize lineage from a basic Airflow pipeline using Marquez. You will need to understand the basics of Airflow, but no experience with lineage is required.

Automatic Speech Recognition at Scale Using Tensorflow, Kubernetes and Airflow

2022-07-01

session

Rafael Pierre

Airflow Kubernetes TensorFlow

Automatic Speech Recognition is quite a compute intensive task, which depends on complex Deep Learning models. To do this at scale, we leveraged the power of Tensorflow, Kubernetes and Airflow. In this session, you will learn about our journey to tackle this problem, main challenges, and how Airflow made it possible to create a solution that is powerful, yet simple and flexible.

Automating Airflow Backfills with Marquez

2022-07-01

session

Willy Lulciuc (WeWork)

Airflow

As a data engineer, backfilling data is an important part of your day-to-day work. But, backfilling interdependent DAGs is time-consuming and often associated with an unpleasant experience. For example, let’s say you were tasked with backfilling a few months worth of data. You’re given the start and end date for the backfill that will be used to run an ad-hoc backfilling script that you have painstakingly crafted locally on your machine. As you sip your morning coffee, you kick off the backfilling script, hoping it’ll work, and think to yourself, there must be a better way. Yes, there is, and collecting DAG lineage metadata would be a great start! In this talk, Willy Lulciuc will briefly introduce you to how backfills are handled in Airflow, then discuss how DAG lineage metadata stored in Marquez can be used to automate backfilling DAGs with complex upstream and downstream dependencies.

Beyond Testing: How to Build Circuit Breakers with Airflow

2022-07-01

session

Prateek Chawla

Airflow Data Quality DataOps DevOps Monte Carlo Python

Testing is an important part of the DataOps life cycle, giving teams confidence in the integrity of their data as it moves downstream to production systems. But what happens when testing doesn’t catch all of your bad data and “unknown unknown” data quality issues fall through the cracks? Fortunately, data engineers can apply a thing or two from DevOps best practices to tackle data quality at scale with circuit breakers, a novel approach to stopping bad data from actually entering your pipelines in the first place. In this talk, Prateek Chawla, Founding Team Member and Technical Lead at Monte Carlo, will discuss what circuit breakers are, how to integrate them with your Airflow DAGs, and what this looks like in practice. Time permitting, Prateek will also walk through how to build and automate Airflow circuit breakers across multiple cascading pipelines with Python and other common tools.

Choosing Apache Airflow over other Proprietary Tools for your Orchestration needs

2022-07-01

session

Parnab Basak (Amazon Web Services)

Airflow Data Management

Organizations need to effectively manage large volumes of complex, business-critical workloads across multiple applications and platforms. Choosing the right workflow orchestration tool is important as it can help teams effectively automate the configuration, coordination, integration, and data management processes on several applications and systems. Currently there are a lot of tools (both open sourced and proprietary) available for orchestrating tasks and data workflows with automation features. Each of them claim to focus on ensuring a centralized, repeatable, reproducible, and efficient workflows coordination. Choosing one among them is an arduous task as it requires an in-depth understanding of the capabilities that these tools have to offer that translate to your specific engineering needs. Apache Airflow which is a powerful and widely-used open-source workflow management system (WMS) designed to programmatically author, schedule, orchestrate, and monitor data pipelines and workflows. In this talk, understand how Apache Airflow compares with other popular orchestration tools in-terms of architecture, scalability, management, observability, automation, native features, cost, available integrations and more. Get a head-to-head comparison of what’s possible as we dissect capabilities of the tools against the other. This comparative analysis will help you in your decision making process, whether you are planning to migrate an existing system or evaluating for your first enterprise orchestration platform.

Data Lineage with Apache Airflow and Apache Spark

2022-07-01

session

Michael Collado

Airflow Spark

Data within today’s organizations has become increasingly distributed and heterogeneous. It can’t be contained within a single brain, a single team, or a single platform…but it still needs to be comprehensible, especially when something unexpected happens. Data lineage can help by tracing the relationships between datasets and providing a cohesive graph that places them in context. OpenLineage provides a standard for lineage collection that spans multiple platforms, including Apache Airflow and Apache Spark. In this session, Michael Collado from Datakin will show how to trace data lineage and useful operational metadata in Apache Spark and Airflow pipelines, and talk about how OpenLineage fits in the context of data pipeline operations and provides insight into the larger data ecosystem.

Data Science Platform at PlayStation and Apache Airflow

2022-07-01

session

Hamed Saljooghinejad , Siraj Malik

Airflow Data Science Docker Kubernetes

In this talk, we explain how Apache Airflow is at the center of our Kubernetes-based Data Science Platform at PlayStation. We talk about how we built a flexible development environment for Data Scientists to interact with Apache Airflow and explain the tools and processes we built to help Data Scientists promote their dags from development to production. We will also talk about the impact of containerization and the usage of KubernetesOperator and the new SparkKubernetesOperator and the benefits of deploying Airflow in Kubernetes using the KubernetesExecutor across multiple environments.

Dynamic Dags -- The New Horizon

2022-07-01

session

Ash Berlin-Taylor

Airflow

In Airflow 2.3 the ability to change the number of tasks dynamically opens up some exciting new ways of building DAGs and lets us create new patterns that just weren’t possible before. In this session I will cover a little bit about AIP-42 and the interface for Dynamic Task Mapping, and cover some common use cases and patterns.

Future of the Airflow UI

2022-07-01

session

Brent Bovenzi

Airflow React

Sneak peek at the future of the Airflow UI. In Airflow 2.3 with the Tree -> Grid view changes, we began to swap out parts of the Flask app with React. This was one step towards AIP-38, to build a fully modern UI for Airflow. Come check out what is in store after Grid view in the current UI. Discuss the possibilities to rethink Airflow with a brand new UI down the line. Such as: Integrating all DAG visualizations into each other and remove constant page reloads More live data Greater cross-DAG visualizations (ie: DAG Dependencies view from 2.1) Improved user settings: (dark mode, color blind support, language, date format) And more!

git push your data stack with Airbyte, Airflow and dbt

2022-07-01

session

Evan Tahler (Airbyte) , Marcos Marx (Airbyte)

Airbyte Airflow dbt DWH Git GitHub

The use of version control and continuous deployment in a data pipeline is one of the biggest features unlocked by the modern data stack. In this talk, I’ll demonstrate how to use Airbyte to pull data into your data warehouse, dbt to generate insights from your data, and Airflow to orchestrate every step of the pipeline. The complete project will be managed by version control and continuously deployed by Github. This talk will share how to achieve a more secure, scalable, and manageable workflow for your data projects.

Happy DAGs + Happy Teammates: How a little CI/CD can go a long way

2022-07-01

session

Leah Cole

Airflow CI/CD Cloud Computing Git GitHub

With a small amount of Cloud Build automation and the use of GitHub version control, your Airflow DAGs will always be tested and in sync no matter who is working on them. Leah will walk you through a sample CICD workflow for keeping your Airflow DAGs tested and in sync between environments and teammates.

Hey maintainer (and user), exercise your empathy!

2022-07-01

session

Jarek Potiuk

Airflow

This talk is a walk through throug a number of ways maintainers of open-source projects (for example Airflow) can improve the communication with their users by exercising empathy. This subject is often overlooked in the cirriculum of average developer and contributor, but one that can make or break the product you developed, simply because it will become more approachable for users. Maintainers often forget or simply do not realize how many assumptions they have in their head. There are a number of techniques maintainers can use to improve it. This talk will walk through a number of examples (from Airflow and other projects), reasoning and ways how communication between maintainers and users can be improved - in the code, documentation, communication but also with involving and engaging the users they are commmunicating with, as more often than not - the users might be of great help when it comes to communication with them - if only asked. This talk is for both - maintainers and users, as I consider communication between users and maintainers two way street.

How DAG Became a Test - Airflow System Tests Redefined

2022-07-01

session

Mateusz Nojek , Bartłomiej Hirsz

Airflow

Nothing is perfect, but it doesn’t mean we shouldn’t seek perfection. After some time spent with Airflow system tests, we have recognized numerous places in which we can make significant improvements. We decided to rediscover them. The new design started with the establishment of goals. Tests need to: be easy to write, read, run and maintain, be as close as possible to how Airflow runs in practice, be fast, reliable and verifiable, assure high quality of Airflow Operators. With these principles in mind, we prepared an Airflow Improvement Proposal (AIP-47) and after it’s confirmed, we started the implementation. The results of our work were better than we expected when we started this initiative. This session will walk you through the story of how we struggled to run system tests before, how we came up with the improvements, and how we put them into a working solution.

How to Achieve Reliable Data in your Airflow Pipelines with Databand

2022-07-01

session

Josh Benamram (Databand.ai)

Airflow Data Quality

Have data quality issues? What about reliability problems? You may be hearing a lot of these terms, along with many others, that describe issues you face with your data. What’s the difference, which are you suffering from, and how do you tackle both? Knowing that your Airflow DAGs are green is not enough. It’s time to focus on data reliability and quality measurements to build trust in your data platform. Join this session to learn how Databand’s proactive data observability platform makes it easy to achieve trusted data in your pipelines. In this session we’ll cover: The core differences between data reliability vs data quality What role the data platform team, analysts, and scientists play in guaranteeing data quality Hands-on demo setting up Airflow reliability tracking

How to Deploy Airflow From Dev to Prod Like A BOSS

2022-07-01

session

Evgeny Shulman

Airflow

Managing Airflow in large-scale environments is tough. You know this, and I know this. But, what if you had a guide to make development, testing, and production lifecycles more manageable? In this presentation, I will share how we manage Airflow for large-scale environments with friendly deployments at every step. After attending the session, Airflow engineers will: Understand the advantages of each kind of deployment Know the differences between Deployment and Airflow Executor Deploy how to incorporate all kinds of deployments for their day-to-day needs

talk-data.com

Top Topics

Top Speakers

Airflow and _____: A discussion around utilizing Airflow with other data tools

Airflow at high scale for Autonomous Driving

Airflow at Pinterest

Airflow at Shopify: Keeping Users Happy while Running Airflow at Scale

Airflow extensions for governing a self-serviced data mesh

Airflow in the Cloud: Lessons from the Field

Airflow / Kubernetes: Running on and using k8s

Airflow & Zeppelin: Better together

All About Deferrables

A look under the hood of the Airflow logging subsystem

An Introduction to Data Lineage with Airflow and Marquez

Automatic Speech Recognition at Scale Using Tensorflow, Kubernetes and Airflow

Automating Airflow Backfills with Marquez

Beyond Testing: How to Build Circuit Breakers with Airflow

Choosing Apache Airflow over other Proprietary Tools for your Orchestration needs

Data Lineage with Apache Airflow and Apache Spark

Data Science Platform at PlayStation and Apache Airflow

Dynamic Dags -- The New Horizon

Future of the Airflow UI

git push your data stack with Airbyte, Airflow and dbt

Happy DAGs + Happy Teammates: How a little CI/CD can go a long way

Hey maintainer (and user), exercise your empathy!

How DAG Became a Test - Airflow System Tests Redefined

How to Achieve Reliable Data in your Airflow Pipelines with Databand

How to Deploy Airflow From Dev to Prod Like A BOSS