talk-data.com

Event

Airflow Summit 2020

2020-07-01 Airflow Summit Visit website ↗

Activities tracked

Airflow Summit 2020 program

Filtering by: Airflow ×

Top Speakers

Jarek Potiuk 3 Maxime Beauchemin 3 Kaxil Naik 2 Daniel Imberman 2 Bolke de Bruin 2 Kevin Yang 2 Ash Berlin-Taylor 1 Rafal Biegacz 1 Leah Cole 1 Ace Haidrey 1 Alaeddine Maaoui 1 Dinghang Yu 1

Sessions & talks

Showing 26–39 of 39 · Newest first

Search within this event →

Keynote: Airflow then and now

2020-07-01

session

Bolke de Bruin , Maxime Beauchemin (Preset)

Airflow

Bolke and Maxime tell us about past on current time of Apache Airflow.

Keynote: Future of Airflow

2020-07-01

session

Kamil Bregula , Kaxil Naik , Daniel Imberman , Jarek Potiuk , Ash Berlin-Taylor , Tomasz Urbaszek

Airflow

A team of core committers explain what is coming to Airflow 2.0.

Keynote: How large companies use Airflow for ML and ETL pipelines

2020-07-01

session

Tao Feng (Lyft | Airflow PMC member) , Dan Davydov (Twitter | Airflow PMC member) , Kevin Yang (Airbnb | Airflow PMC member)

AI/ML Airflow ETL/ELT

In this talk, colleagues from Airbnb, Twitter and Lyft share details about how they are using Apache Airflow to power their data pipelines. Slides: Tao Feng (Lyft) Dan Davydov (Twitter)

Keynote: Making Airflow a sustainable project through D&I

2020-07-01

session

Aizhamal Nurmamat kyzy , Griselda Cuevas (Apache Software Foundation)

Airflow

Gris Cuevas shares some statistics about the state of D&I at the Apache Software Foundation and also the initiative the foundation is taking to make projects more diverse and inclusive. Then, Aizhamal shares her own journey on becoming an open source contributor, and dives into project specific initiatives that help Apache Airflow to be one of the most sustainable projects in open source.

Machine Learning with Apache Airflow

2020-07-01

session

Daniel Imberman

AI/ML Airflow Cloud Computing Data Science GCP Cloud Functions

This talk discusses how to build an Airflow based data platform that can take advantage of popular ML tools (Jupyter, Tensorflow, Spark) while creating an easy-to-manage/monitor As the field of data science grows in popularity, companies find themselves in need of a single common language that can connect their data science teams and data infrastructure teams. Data scientists want rapid iteration, infrastructure engineers want monitoring and security controls, and product owners want their solutions deployed in time for quarterly reports. This talk will discuss how to build an Airflow based data platform that can take advantage of popular ML tools (Jupyter, Tensorflow, Spark) while creating an easy-to-manage/monitor ecosystem for data infrastructure and support team. In this talk, we will take an idea from a single-machine Jupyter Notebook to a cross-service Spark + Tensorflow pipeline, to a canary tested, production-ready model served on Google Cloud Functions. We will show how Apache Airflow can connect all layers of a data team to deliver rapid results.

Migrating Airflow-based Spark jobs to Kubernetes - the native way

2020-07-01

session

Roi Teveth (Nielsen Identity Engine) , Itai Yaffe (Nielsen Identity Engine)

Airflow AWS Amazon EMR GCP Kubernetes Spark

At Nielsen Identity Engine, we use Spark to process 10’s of TBs of data. Our ETLs, orchestrated by Airflow, spin-up AWS EMR clusters with thousands of nodes per day. In this talk, we’ll guide you through migrating Spark workloads to Kubernetes with minimal changes to Airflow DAGs, using the open-sourced GCP Spark-on-K8s operator and the native integration we recently contributed to the Airflow project.

Migration to Airflow backport providers

2020-07-01

session

Anita Fronczak

Airflow

In this talk Anita showcases how to use the newly released Airflow Backport Providers. Some of the topics we will cover are: How to install them in Airflow 1.10.x How to install them in Composer How to migrate one or more DAG from using legacy to new providers. Known bugs and fixes.

Pipelines on pipelines: Agile CI/CD workflows for Airflow DAGs

2020-07-01

session

Victor Shafran (Databand)

Agile/Scrum AI/ML Airflow Analytics CI/CD Data Engineering

How do you create fast and painless delivery of new DAGs into production? When running Airflow at scale, it becomes a big challenge to manage the full lifecycle around your pipelines; making sure that DAGs are easy to develop, test, and ship into prod. In this talk, we will cover our suggested approach to building a proper CI/CD cycle that ensures the quality and fast delivery of production pipelines. CI/CD is the practice of delivering software from dev to prod, optimized for fast iteration and quality control. In the data engineering context, DAGs are just another piece of software that require some form of lifecycle management. Traditionally, DAGs have been thought of as relatively static, but the new wave of analytics and machine learning efforts require more agile DAG development, in line with how agile software engineering teams build and ship code. In this session, we will dive into the challenges of building CI/CD cycles for Airflow DAGs. We will focus on a pipeline that involves Apache Spark as an extra dimension of real-world complexity, walking through a typical flow of DAG authoring, debugging, and testing, from local to staging to prod environments. We will offer best practices and discuss open-source tools you can use to easily build your own smooth cycle for Airflow CI/CD.

Production Docker image for Apache Airflow

2020-07-01

session

Jarek Potiuk

Airflow Docker

This talk will guide you trough internals of the official Production Docker Image of Airflow. It will show you the foreseen use cases for it and how to use it in conjunction with the Official Helm Chart to make your own deployments.

Run Airflow DAGs in a secure way

2020-07-01

session

Rafal Biegacz

Airflow Cloud Computing GCP Cloud Composer Cyber Security

In the contemporary world security is important more than ever - Airflow installations are no exception. Google Cloud Platform and Cloud Composer offer useful security options for running your DAGs and tasks in a way so you effectively can manage a risk of data exfiltration and access to the system is limited. This is a sponsored talk, presented by Google Cloud .

Scheduler as a service - Apache Airflow at EA Digital Platform

2020-07-01

session

Xiaoqin Zhu (Electronic Arts) , Nitish Victor , Preethi Ganeshan (Electronic Arts)

Agile/Scrum Airflow Data Lake Kubernetes Cyber Security

In this talk, we share the lessons learned while building a scheduler-as-a-service leveraging Apache Airflow to achieve improved stability and security for one of the largest gaming companies. The platform integrates with different data sources and meets varied SLA’s across workflows owned by multiple game studios. In particular, we present a comprehensive self-serve airflow architecture with multi-tenancy, auto-dag generation, SSO-integration with improved ease of deployment. Within Electronic Arts, to provide scheduler-as-a-service and to support hundreds of thousands of execution workflows, each team requires an isolated environment with access to a central data lake containing several petabytes of anonymized player and game metrics. Leveraging Airflow, each team is provided a private code repository and namespace with which they can deploy their DAGs at their own behest. To support agile development cycles, a private testing sandbox and auto-deployment to an isolated multi-tenant airflow platform has been made available to game studios. In production, a single dockerized airflow deployment on Kubernetes is utilized to ensure highly availability and single-step deployment. Custom SSO-integration and RBAC-based operator and sensor whitelisting allows for secure logical isolation. In addition, providing dynamic DAG instantiation capability helps address varied SLA’s during game launch seasons that are staggered through a financial year.

Teaching an old DAG new tricks

2020-07-01

session

QP Hou

Airflow AWS Cloud Computing Datadog Git PagerDuty

Scribd is migrating its data pipeline from an in house system to Airflow. It’s a one big giant data pipeline consisting of more than 1,500 tasks. In this talk, I would like to share couple best practices on setting up a cloud native Airflow deployment in AWS. For those who are interested in migrating a non-trivial data pipeline to Airflow, I will also share how Scribd plans and executes the migration. Here are some of the topics that will be covered: How to setup a highly available Airflow cluster in AWS using both ECS and EKS with Terraform. How to manage Airflow DAGs across multiple git repositories. How we manage Airflow variables using a custom Airflow Terraform provider. Best practices on monitoring multiple Airflow clusters with Datadog and Pagerduty. How to Airflow to make it feature parity with Scribd’s in house orchestration system. How to plan and execute non-trivial data pipeline migrations. We transcompiled internal DSL to Airflow DAG to simulate what a real run will look like to surface performance issues early in the process. How we fixed an Airflow performance bottleneck so our giant DAG can be properly rendered in Web UI. For detailed deep dives on some of topics mentioned above, please check out our blog post series at https://tech.scribd.com/tag/airflow-series/ [Slides] ( https://docs.google.com/presentation/d/e/2PACX-1vRb-iH5NX2d7m-rQ7WGc6XlRvRCADwXq2hdjRjRuJ5h7e9ybfoUA13ytxpHgx7JG815fIKEE-QKuRUV/pub?start=false&loop=false&delayms=3000 )

Testing Airflow workflows - ensuring your DAGs work before going into production

2020-07-01

session

Bas Harenslak

Airflow

How do you ensure your workflows work before deploying to production? In this talk I’ll go over various ways to assure your code works as intended - both on a task and a DAG level. In this talk I cover: How to test and debug tasks locally How to test with and without task instance context How to test against external systems, e.g. how to test a PostgresOperator? How to test the integration of multiple tasks to ensure they work nicely together

Using Airflow to speed up development of data intensive tools

2020-07-01

session

Blaine Elliot (One Medical)

Airflow Cyber Security

In this talk we review how Airflow helped create a tool to detect data anomalies. Leveraging Airflow for process management, database interoperability, and authentication created an easy path forward to achieve scale, decrease the development time and pass security audits. While Airflow is generally looked at as a solution to manage data pipelines, integrating tools with Airflow can also speed up development of those tools. The Data Anomaly Detector was created at One Medical to scan thousands of metrics per day for data anomalies. It’s a complicated tool and much of that complexity was outsourced to Airflow. Because the data infrastructure at One Medical was already built around Airflow, and Airflow had many desirable features, it made sense to build the tool to integrate closely with Airflow. The end result was more time could be spend on building features to do statistical analysis, and less effort had to be spent on database authentication, interoperability or process management. It’s an interesting example of how Airflow can be leveraged to build data intensive tools.

Page 2 of 2

← Previous

1 2

Airflow Summit 2020

Top Topics

Top Speakers