talk-data.com

Topic

Airflow

Apache Airflow

workflow_management data_orchestration etl

Activities

tagged

Activity Trend

157 peak/qtr

2020-Q1 2026-Q1

Top Events

Airflow Summit 2025 139 Data Engineering Podcast 122 Airflow Summit 2024 92 Airflow Summit 2023 81 Airflow Summit 2022 52 Airflow Summit 2021 52 Airflow Summit 2020 39 O'Reilly Data Engineering Books 11 DATA MINER Big Data Europe Conference 2020 5 dbt Coalesce 2022 5 Airflow Monthly Virtual Town Hall- August 4 Airflow Monthly Virtual Town Hall- March 4

Top Speakers

Tobias Macey 122 Jarek Potiuk (Apache Software Foundation) 15 Kaxil Naik 12 Ash Berlin-Taylor (Astronomer) 11 Rafal Biegacz 10 Vikram Koka (Astronomer) 9 John Jackson 9 Brent Bovenzi (Astronomer) 7 Amogh Rajesh Desai 7 Maxime Beauchemin (Preset) 7 Tatiana Al-Chueyr Martins (Astronomer) 6 Jens Scheffler 6

Activities

Showing filtered results

All Video Podcast Book

Filtering by: Airflow Summit 2020 ×

From cron to Airflow on Kubernetes: A startup story

2020-07-01 · Airflow Summit 2020

session

by Adam Boscarino (Devoted Health)

Kubernetes

Learn how Devoted Health went from cron jobs to Airflow deployment Kubernetes using a combination of open source and internal tooling. Devoted Health, a Medicare Advantage startup, went from cron jobs to Airflow on Kubernetes in a short period of time. This journey is a common one, but still has a steep learning curve for new Airflow users. This talk will give you a blueprint to follow by covering the tools we use, best practices, and lessons learned. We’ll share Devoted’s approach to managing our deployment, monitoring the platform, and developing, testing, and deploying DAGs. This includes internal tooling we’ve written that allows Data Scientists to work with Airflow without worrying about Airflow itself.

From S3 to BigQuery - How a first-time Airflow user successfully implemented a data pipeline

2020-07-01 · Airflow Summit 2020

session

by Leah Cole

BigQuery Cloud Computing Cloud Storage DWH GCP Cloud Composer S3

BigQuery is GCP’s serverless, highly scalable and cost-effective cloud data warehouse that can analyze petabytes of data at super fast speeds. Amazon S3 is one of the oldest and most popular cloud storage offerings. Folks with data in S3 often want to use BigQuery to gain insights into their data. Using Apache Airflow, they can build pipelines to seamlessly orchestrate that connection. In this talk, Leah walks through how they created an easily configurable pipeline to extract data. When a team at work mentioned wanting to set up a repeatable process for migrating data stored in S3 to BigQuery, Leah knew using Cloud Composer (GCP-hosted Airflow) was the right tool for the job, but she didn’t have much experience with the proprietary file types the data used. Luckily, one of her colleagues did have experience with that proprietary file type, though they hadn’t worked with Airflow. Leah and her colleague teamed up to build a reusable, easily configurable solution for the team. She will walk you through their problem, the solution, and the process they took for coming to that solution, highlighting resources that were especially useful to a first-time Airflow user.

From Zero to Airflow: bootstrapping a ML platform

2020-07-01 · Airflow Summit 2020

session

by Noam Elfanbaum (Bluevine)

AI/ML CI/CD Grafana Python

At Bluevine we use Airflow to drive our ML platform. In this talk, Noam presents the challenges and gains we had at transitioning from a single server running Python scripts with cron to a full blown Airflow setup. This includes: supporting multiple Python versions, event driven DAGs, performance issues and more! Some of the points that I’ll cover are: Supporting multiple Python versions Event driven DAGs Airflow Performance issues and how we circumvented them Building Airflow plugins to enhance observability Monitoring Airflow using Grafana CI for Airflow DAGs (super useful!) Patching Airflow scheduler Slides

How do we reason about the reliability of our data pipeline in Wrike

2020-07-01 · Airflow Summit 2020

session

by Alexander Eliseev (Wrike)

In this talk we will share some of the lessons we have learned after using Airflow for a couple of years and growing from 2 users to 8 teams. We cover: establishing a reliable review process on AirFlow, managing multiple Airflow configurations, data versioning.

Improving Airflow's user experience

2020-07-01 · Airflow Summit 2020

session

by Ry Walker (Astronomer) , Maxime Beauchemin (Preset) , Viraj Parekh

Astronomer Cyber Security

Astronomer is focused on improving Airflow’s user experience through the entire lifecycle — from authoring + testing DAGs, to building containers and deploying the DAGs, to running and monitoring both the DAGs and the infrastructure that they are operating within — with an eye towards increased security and governance as well. In this talk we walk you through some current UX challenges, an overview of how the Astronomer platform addresses the major challenges, and also provide sneak peek of the things that we’re working on in the coming months to improve Airflow’s user experience. This is a sponsored talk, presented by Astronomer .

Keynote: Airflow then and now

2020-07-01 · Airflow Summit 2020

session

by Bolke de Bruin , Maxime Beauchemin (Preset)

Bolke and Maxime tell us about past on current time of Apache Airflow.

Keynote: Future of Airflow

2020-07-01 · Airflow Summit 2020

session

by Kamil Bregula , Jarek Potiuk (Apache Software Foundation) , Kaxil Naik , Daniel Imberman , Ash Berlin-Taylor (Astronomer) , Tomasz Urbaszek

A team of core committers explain what is coming to Airflow 2.0.

Keynote: How large companies use Airflow for ML and ETL pipelines

2020-07-01 · Airflow Summit 2020

session

by Tao Feng (Lyft | Airflow PMC member) , Dan Davydov (Twitter | Airflow PMC member) , Kevin Yang (Airbnb | Airflow PMC member)

AI/ML ETL/ELT

In this talk, colleagues from Airbnb, Twitter and Lyft share details about how they are using Apache Airflow to power their data pipelines. Slides: Tao Feng (Lyft) Dan Davydov (Twitter)

Keynote: Making Airflow a sustainable project through D&I

2020-07-01 · Airflow Summit 2020

session

by Aizhamal Nurmamat kyzy , Griselda Cuevas (Apache Software Foundation)

Gris Cuevas shares some statistics about the state of D&I at the Apache Software Foundation and also the initiative the foundation is taking to make projects more diverse and inclusive. Then, Aizhamal shares her own journey on becoming an open source contributor, and dives into project specific initiatives that help Apache Airflow to be one of the most sustainable projects in open source.

Machine Learning with Apache Airflow

2020-07-01 · Airflow Summit 2020

session

by Daniel Imberman

AI/ML Cloud Computing Data Science GCP Cloud Functions Cyber Security Spark TensorFlow

This talk discusses how to build an Airflow based data platform that can take advantage of popular ML tools (Jupyter, Tensorflow, Spark) while creating an easy-to-manage/monitor As the field of data science grows in popularity, companies find themselves in need of a single common language that can connect their data science teams and data infrastructure teams. Data scientists want rapid iteration, infrastructure engineers want monitoring and security controls, and product owners want their solutions deployed in time for quarterly reports. This talk will discuss how to build an Airflow based data platform that can take advantage of popular ML tools (Jupyter, Tensorflow, Spark) while creating an easy-to-manage/monitor ecosystem for data infrastructure and support team. In this talk, we will take an idea from a single-machine Jupyter Notebook to a cross-service Spark + Tensorflow pipeline, to a canary tested, production-ready model served on Google Cloud Functions. We will show how Apache Airflow can connect all layers of a data team to deliver rapid results.

Migrating Airflow-based Spark jobs to Kubernetes - the native way

2020-07-01 · Airflow Summit 2020

session

by Roi Teveth (Nielsen Identity Engine) , Itai Yaffe (Nielsen Identity Engine)

AWS Amazon EMR GCP Kubernetes Spark

At Nielsen Identity Engine, we use Spark to process 10’s of TBs of data. Our ETLs, orchestrated by Airflow, spin-up AWS EMR clusters with thousands of nodes per day. In this talk, we’ll guide you through migrating Spark workloads to Kubernetes with minimal changes to Airflow DAGs, using the open-sourced GCP Spark-on-K8s operator and the native integration we recently contributed to the Airflow project.

Migration to Airflow backport providers

2020-07-01 · Airflow Summit 2020

session

by Anita Fronczak

In this talk Anita showcases how to use the newly released Airflow Backport Providers. Some of the topics we will cover are: How to install them in Airflow 1.10.x How to install them in Composer How to migrate one or more DAG from using legacy to new providers. Known bugs and fixes.

Pipelines on pipelines: Agile CI/CD workflows for Airflow DAGs

2020-07-01 · Airflow Summit 2020

session

by Victor Shafran (Databand)

Agile/Scrum AI/ML Analytics CI/CD Data Engineering Spark

How do you create fast and painless delivery of new DAGs into production? When running Airflow at scale, it becomes a big challenge to manage the full lifecycle around your pipelines; making sure that DAGs are easy to develop, test, and ship into prod. In this talk, we will cover our suggested approach to building a proper CI/CD cycle that ensures the quality and fast delivery of production pipelines. CI/CD is the practice of delivering software from dev to prod, optimized for fast iteration and quality control. In the data engineering context, DAGs are just another piece of software that require some form of lifecycle management. Traditionally, DAGs have been thought of as relatively static, but the new wave of analytics and machine learning efforts require more agile DAG development, in line with how agile software engineering teams build and ship code. In this session, we will dive into the challenges of building CI/CD cycles for Airflow DAGs. We will focus on a pipeline that involves Apache Spark as an extra dimension of real-world complexity, walking through a typical flow of DAG authoring, debugging, and testing, from local to staging to prod environments. We will offer best practices and discuss open-source tools you can use to easily build your own smooth cycle for Airflow CI/CD.

Production Docker image for Apache Airflow

2020-07-01 · Airflow Summit 2020

session

by Jarek Potiuk (Apache Software Foundation)

Docker

This talk will guide you trough internals of the official Production Docker Image of Airflow. It will show you the foreseen use cases for it and how to use it in conjunction with the Official Helm Chart to make your own deployments.

Run Airflow DAGs in a secure way

2020-07-01 · Airflow Summit 2020

session

by Rafal Biegacz

Cloud Computing GCP Cloud Composer Cyber Security

In the contemporary world security is important more than ever - Airflow installations are no exception. Google Cloud Platform and Cloud Composer offer useful security options for running your DAGs and tasks in a way so you effectively can manage a risk of data exfiltration and access to the system is limited. This is a sponsored talk, presented by Google Cloud .

Scheduler as a service - Apache Airflow at EA Digital Platform

2020-07-01 · Airflow Summit 2020

session

by Xiaoqin Zhu (Electronic Arts) , Nitish Victor , Preethi Ganeshan (Electronic Arts)

Agile/Scrum Data Lake Kubernetes Cyber Security

In this talk, we share the lessons learned while building a scheduler-as-a-service leveraging Apache Airflow to achieve improved stability and security for one of the largest gaming companies. The platform integrates with different data sources and meets varied SLA’s across workflows owned by multiple game studios. In particular, we present a comprehensive self-serve airflow architecture with multi-tenancy, auto-dag generation, SSO-integration with improved ease of deployment. Within Electronic Arts, to provide scheduler-as-a-service and to support hundreds of thousands of execution workflows, each team requires an isolated environment with access to a central data lake containing several petabytes of anonymized player and game metrics. Leveraging Airflow, each team is provided a private code repository and namespace with which they can deploy their DAGs at their own behest. To support agile development cycles, a private testing sandbox and auto-deployment to an isolated multi-tenant airflow platform has been made available to game studios. In production, a single dockerized airflow deployment on Kubernetes is utilized to ensure highly availability and single-step deployment. Custom SSO-integration and RBAC-based operator and sensor whitelisting allows for secure logical isolation. In addition, providing dynamic DAG instantiation capability helps address varied SLA’s during game launch seasons that are staggered through a financial year.

Teaching an old DAG new tricks

2020-07-01 · Airflow Summit 2020

session

by QP Hou

AWS Cloud Computing Datadog Git PagerDuty Terraform

Scribd is migrating its data pipeline from an in house system to Airflow. It’s a one big giant data pipeline consisting of more than 1,500 tasks. In this talk, I would like to share couple best practices on setting up a cloud native Airflow deployment in AWS. For those who are interested in migrating a non-trivial data pipeline to Airflow, I will also share how Scribd plans and executes the migration. Here are some of the topics that will be covered: How to setup a highly available Airflow cluster in AWS using both ECS and EKS with Terraform. How to manage Airflow DAGs across multiple git repositories. How we manage Airflow variables using a custom Airflow Terraform provider. Best practices on monitoring multiple Airflow clusters with Datadog and Pagerduty. How to Airflow to make it feature parity with Scribd’s in house orchestration system. How to plan and execute non-trivial data pipeline migrations. We transcompiled internal DSL to Airflow DAG to simulate what a real run will look like to surface performance issues early in the process. How we fixed an Airflow performance bottleneck so our giant DAG can be properly rendered in Web UI. For detailed deep dives on some of topics mentioned above, please check out our blog post series at https://tech.scribd.com/tag/airflow-series/ [Slides] ( https://docs.google.com/presentation/d/e/2PACX-1vRb-iH5NX2d7m-rQ7WGc6XlRvRCADwXq2hdjRjRuJ5h7e9ybfoUA13ytxpHgx7JG815fIKEE-QKuRUV/pub?start=false&loop=false&delayms=3000 )

Testing Airflow workflows - ensuring your DAGs work before going into production

2020-07-01 · Airflow Summit 2020

session

by Bas Harenslak (Astronomer)

How do you ensure your workflows work before deploying to production? In this talk I’ll go over various ways to assure your code works as intended - both on a task and a DAG level. In this talk I cover: How to test and debug tasks locally How to test with and without task instance context How to test against external systems, e.g. how to test a PostgresOperator? How to test the integration of multiple tasks to ensure they work nicely together

Using Airflow to speed up development of data intensive tools

2020-07-01 · Airflow Summit 2020

session

by Blaine Elliot (One Medical)

Cyber Security

In this talk we review how Airflow helped create a tool to detect data anomalies. Leveraging Airflow for process management, database interoperability, and authentication created an easy path forward to achieve scale, decrease the development time and pass security audits. While Airflow is generally looked at as a solution to manage data pipelines, integrating tools with Airflow can also speed up development of those tools. The Data Anomaly Detector was created at One Medical to scan thousands of metrics per day for data anomalies. It’s a complicated tool and much of that complexity was outsourced to Airflow. Because the data infrastructure at One Medical was already built around Airflow, and Airflow had many desirable features, it made sense to build the tool to integrate closely with Airflow. The end result was more time could be spend on building features to do statistical analysis, and less effort had to be spent on database authentication, interoperability or process management. It’s an interesting example of how Airflow can be leveraged to build data intensive tools.

Page 2 of 2

← Previous

1 2