Airflow Summit 2020

How do we reason about the reliability of our data pipeline in Wrike

2020-07-01

session

Alexander Eliseev (Wrike)

Airflow

In this talk we will share some of the lessons we have learned after using Airflow for a couple of years and growing from 2 users to 8 teams. We cover: establishing a reliable review process on AirFlow, managing multiple Airflow configurations, data versioning.

Improving Airflow's user experience

2020-07-01

session

Ry Walker (Astronomer) , Maxime Beauchemin (Preset) , Viraj Parekh

Airflow Astronomer Cyber Security

Astronomer is focused on improving Airflow’s user experience through the entire lifecycle — from authoring + testing DAGs, to building containers and deploying the DAGs, to running and monitoring both the DAGs and the infrastructure that they are operating within — with an eye towards increased security and governance as well. In this talk we walk you through some current UX challenges, an overview of how the Astronomer platform addresses the major challenges, and also provide sneak peek of the things that we’re working on in the coming months to improve Airflow’s user experience. This is a sponsored talk, presented by Astronomer .

Keynote: Airflow then and now

2020-07-01

session

Bolke de Bruin , Maxime Beauchemin (Preset)

Airflow

Bolke and Maxime tell us about past on current time of Apache Airflow.

Keynote: Future of Airflow

2020-07-01

session

Kamil Bregula , Kaxil Naik , Daniel Imberman , Jarek Potiuk , Ash Berlin-Taylor , Tomasz Urbaszek

Airflow

A team of core committers explain what is coming to Airflow 2.0.

Keynote: How large companies use Airflow for ML and ETL pipelines

2020-07-01

session

Tao Feng (Lyft | Airflow PMC member) , Dan Davydov (Twitter | Airflow PMC member) , Kevin Yang (Airbnb | Airflow PMC member)

AI/ML Airflow ETL/ELT

In this talk, colleagues from Airbnb, Twitter and Lyft share details about how they are using Apache Airflow to power their data pipelines. Slides: Tao Feng (Lyft) Dan Davydov (Twitter)

Keynote: Making Airflow a sustainable project through D&I

2020-07-01

session

Aizhamal Nurmamat kyzy , Griselda Cuevas (Apache Software Foundation)

Airflow

Gris Cuevas shares some statistics about the state of D&I at the Apache Software Foundation and also the initiative the foundation is taking to make projects more diverse and inclusive. Then, Aizhamal shares her own journey on becoming an open source contributor, and dives into project specific initiatives that help Apache Airflow to be one of the most sustainable projects in open source.

Machine Learning with Apache Airflow

2020-07-01

session

Daniel Imberman

AI/ML Airflow Cloud Computing Data Science GCP Cloud Functions

This talk discusses how to build an Airflow based data platform that can take advantage of popular ML tools (Jupyter, Tensorflow, Spark) while creating an easy-to-manage/monitor As the field of data science grows in popularity, companies find themselves in need of a single common language that can connect their data science teams and data infrastructure teams. Data scientists want rapid iteration, infrastructure engineers want monitoring and security controls, and product owners want their solutions deployed in time for quarterly reports. This talk will discuss how to build an Airflow based data platform that can take advantage of popular ML tools (Jupyter, Tensorflow, Spark) while creating an easy-to-manage/monitor ecosystem for data infrastructure and support team. In this talk, we will take an idea from a single-machine Jupyter Notebook to a cross-service Spark + Tensorflow pipeline, to a canary tested, production-ready model served on Google Cloud Functions. We will show how Apache Airflow can connect all layers of a data team to deliver rapid results.

Migrating Airflow-based Spark jobs to Kubernetes - the native way

2020-07-01

session

Roi Teveth (Nielsen Identity Engine) , Itai Yaffe (Nielsen Identity Engine)

Airflow AWS Amazon EMR GCP Kubernetes Spark

At Nielsen Identity Engine, we use Spark to process 10’s of TBs of data. Our ETLs, orchestrated by Airflow, spin-up AWS EMR clusters with thousands of nodes per day. In this talk, we’ll guide you through migrating Spark workloads to Kubernetes with minimal changes to Airflow DAGs, using the open-sourced GCP Spark-on-K8s operator and the native integration we recently contributed to the Airflow project.

Migration to Airflow backport providers

2020-07-01

session

Anita Fronczak

Airflow

In this talk Anita showcases how to use the newly released Airflow Backport Providers. Some of the topics we will cover are: How to install them in Airflow 1.10.x How to install them in Composer How to migrate one or more DAG from using legacy to new providers. Known bugs and fixes.

Pipelines on pipelines: Agile CI/CD workflows for Airflow DAGs

2020-07-01

session

Victor Shafran (Databand)

Agile/Scrum AI/ML Airflow Analytics CI/CD Data Engineering

How do you create fast and painless delivery of new DAGs into production? When running Airflow at scale, it becomes a big challenge to manage the full lifecycle around your pipelines; making sure that DAGs are easy to develop, test, and ship into prod. In this talk, we will cover our suggested approach to building a proper CI/CD cycle that ensures the quality and fast delivery of production pipelines. CI/CD is the practice of delivering software from dev to prod, optimized for fast iteration and quality control. In the data engineering context, DAGs are just another piece of software that require some form of lifecycle management. Traditionally, DAGs have been thought of as relatively static, but the new wave of analytics and machine learning efforts require more agile DAG development, in line with how agile software engineering teams build and ship code. In this session, we will dive into the challenges of building CI/CD cycles for Airflow DAGs. We will focus on a pipeline that involves Apache Spark as an extra dimension of real-world complexity, walking through a typical flow of DAG authoring, debugging, and testing, from local to staging to prod environments. We will offer best practices and discuss open-source tools you can use to easily build your own smooth cycle for Airflow CI/CD.

Production Docker image for Apache Airflow

2020-07-01

session

Jarek Potiuk

Airflow Docker

This talk will guide you trough internals of the official Production Docker Image of Airflow. It will show you the foreseen use cases for it and how to use it in conjunction with the Official Helm Chart to make your own deployments.

Run Airflow DAGs in a secure way

2020-07-01

session

Rafal Biegacz

Airflow Cloud Computing GCP Cloud Composer Cyber Security

In the contemporary world security is important more than ever - Airflow installations are no exception. Google Cloud Platform and Cloud Composer offer useful security options for running your DAGs and tasks in a way so you effectively can manage a risk of data exfiltration and access to the system is limited. This is a sponsored talk, presented by Google Cloud .

Scheduler as a service - Apache Airflow at EA Digital Platform

2020-07-01

session

Xiaoqin Zhu (Electronic Arts) , Nitish Victor , Preethi Ganeshan (Electronic Arts)

Agile/Scrum Airflow Data Lake Kubernetes Cyber Security

In this talk, we share the lessons learned while building a scheduler-as-a-service leveraging Apache Airflow to achieve improved stability and security for one of the largest gaming companies. The platform integrates with different data sources and meets varied SLA’s across workflows owned by multiple game studios. In particular, we present a comprehensive self-serve airflow architecture with multi-tenancy, auto-dag generation, SSO-integration with improved ease of deployment. Within Electronic Arts, to provide scheduler-as-a-service and to support hundreds of thousands of execution workflows, each team requires an isolated environment with access to a central data lake containing several petabytes of anonymized player and game metrics. Leveraging Airflow, each team is provided a private code repository and namespace with which they can deploy their DAGs at their own behest. To support agile development cycles, a private testing sandbox and auto-deployment to an isolated multi-tenant airflow platform has been made available to game studios. In production, a single dockerized airflow deployment on Kubernetes is utilized to ensure highly availability and single-step deployment. Custom SSO-integration and RBAC-based operator and sensor whitelisting allows for secure logical isolation. In addition, providing dynamic DAG instantiation capability helps address varied SLA’s during game launch seasons that are staggered through a financial year.

Teaching an old DAG new tricks

2020-07-01

session

QP Hou

Airflow AWS Cloud Computing Datadog Git PagerDuty

Scribd is migrating its data pipeline from an in house system to Airflow. It’s a one big giant data pipeline consisting of more than 1,500 tasks. In this talk, I would like to share couple best practices on setting up a cloud native Airflow deployment in AWS. For those who are interested in migrating a non-trivial data pipeline to Airflow, I will also share how Scribd plans and executes the migration. Here are some of the topics that will be covered: How to setup a highly available Airflow cluster in AWS using both ECS and EKS with Terraform. How to manage Airflow DAGs across multiple git repositories. How we manage Airflow variables using a custom Airflow Terraform provider. Best practices on monitoring multiple Airflow clusters with Datadog and Pagerduty. How to Airflow to make it feature parity with Scribd’s in house orchestration system. How to plan and execute non-trivial data pipeline migrations. We transcompiled internal DSL to Airflow DAG to simulate what a real run will look like to surface performance issues early in the process. How we fixed an Airflow performance bottleneck so our giant DAG can be properly rendered in Web UI. For detailed deep dives on some of topics mentioned above, please check out our blog post series at https://tech.scribd.com/tag/airflow-series/ [Slides] ( https://docs.google.com/presentation/d/e/2PACX-1vRb-iH5NX2d7m-rQ7WGc6XlRvRCADwXq2hdjRjRuJ5h7e9ybfoUA13ytxpHgx7JG815fIKEE-QKuRUV/pub?start=false&loop=false&delayms=3000 )

Testing Airflow workflows - ensuring your DAGs work before going into production

2020-07-01

session

Bas Harenslak

Airflow

How do you ensure your workflows work before deploying to production? In this talk I’ll go over various ways to assure your code works as intended - both on a task and a DAG level. In this talk I cover: How to test and debug tasks locally How to test with and without task instance context How to test against external systems, e.g. how to test a PostgresOperator? How to test the integration of multiple tasks to ensure they work nicely together

Using Airflow to speed up development of data intensive tools

2020-07-01

session

Blaine Elliot (One Medical)

Airflow Cyber Security

In this talk we review how Airflow helped create a tool to detect data anomalies. Leveraging Airflow for process management, database interoperability, and authentication created an easy path forward to achieve scale, decrease the development time and pass security audits. While Airflow is generally looked at as a solution to manage data pipelines, integrating tools with Airflow can also speed up development of those tools. The Data Anomaly Detector was created at One Medical to scan thousands of metrics per day for data anomalies. It’s a complicated tool and much of that complexity was outsourced to Airflow. Because the data infrastructure at One Medical was already built around Airflow, and Airflow had many desirable features, it made sense to build the tool to integrate closely with Airflow. The end result was more time could be spend on building features to do statistical analysis, and less effort had to be spent on database authentication, interoperability or process management. It’s an interesting example of how Airflow can be leveraged to build data intensive tools.

What open source taught us about business

2020-07-01

session

Karolina Rosol (Snowflake) , Maciej Oczko (Snowflake)

This talk shares Polidea’s journey from mobile app development studio to an OSS oriented business partner. We will tell you our story towards code leadership throughout the years. We are also going to share the challenges and practical insights into managing open source projects in our company. After this talk, you will know how we approached combining open source, business and team management not forgetting about a human aspect. This is a sponsored talk, presented by Polidea .

talk-data.com

Top Topics

Top Speakers

How do we reason about the reliability of our data pipeline in Wrike

Improving Airflow's user experience

Keynote: Airflow then and now

Keynote: Future of Airflow

Keynote: How large companies use Airflow for ML and ETL pipelines

Keynote: Making Airflow a sustainable project through D&I

Machine Learning with Apache Airflow

Migrating Airflow-based Spark jobs to Kubernetes - the native way

Migration to Airflow backport providers

Pipelines on pipelines: Agile CI/CD workflows for Airflow DAGs

Production Docker image for Apache Airflow

Run Airflow DAGs in a secure way

Scheduler as a service - Apache Airflow at EA Digital Platform

Teaching an old DAG new tricks

Testing Airflow workflows - ensuring your DAGs work before going into production

Using Airflow to speed up development of data intensive tools

What open source taught us about business