talk-data.com talk-data.com

Event

Airflow Summit 2020

2020-07-01 Airflow Summit Visit website ↗

Activities tracked

5

Airflow Summit 2020 program

Filtering by: AI/ML ×

Sessions & talks

Showing 1–5 of 5 · Newest first

Search within this event →

Airflow as an elastic ETL tool

2020-07-01
session
Vicente Rubén Del Pino Ruiz (UnitedHealth Group) , Hendrik Kleine (Optum)

In search of a better, modern, simplistic method of managing ETL’s processes and merging them with various AI and ML tasks, we landed on Airflow. We envisioned a new user friendly interface that can leverage dynamic DAG’s and reusable components to build an ETL tool that requires virtually no training. We built several template DAG’s and connectors for Airflow to typical data sources, like SQL Server. Then proceeded to build a modern interface on top that brings ETL build, scheduling and execution capabilities. Acknowledging Airflow is designed for task orchestration, we expanded our infrastructure to use K8 and Docker for elastic computing. Key to our solution is the ability to create ETL’s using only open source tools, whilst executing on-par or faster than commercial solutions and an interface so simple that ETL’s could be created in seconds.

From Zero to Airflow: bootstrapping a ML platform

2020-07-01
session
Noam Elfanbaum (Bluevine)

At Bluevine we use Airflow to drive our ML platform. In this talk, Noam presents the challenges and gains we had at transitioning from a single server running Python scripts with cron to a full blown Airflow setup. This includes: supporting multiple Python versions, event driven DAGs, performance issues and more! Some of the points that I’ll cover are: Supporting multiple Python versions Event driven DAGs Airflow Performance issues and how we circumvented them Building Airflow plugins to enhance observability Monitoring Airflow using Grafana CI for Airflow DAGs (super useful!) Patching Airflow scheduler Slides

Keynote: How large companies use Airflow for ML and ETL pipelines

2020-07-01
session
Tao Feng (Lyft | Airflow PMC member) , Dan Davydov (Twitter | Airflow PMC member) , Kevin Yang (Airbnb | Airflow PMC member)

In this talk, colleagues from Airbnb, Twitter and Lyft share details about how they are using Apache Airflow to power their data pipelines. Slides: Tao Feng (Lyft) Dan Davydov (Twitter)

Machine Learning with Apache Airflow

2020-07-01
session

This talk discusses how to build an Airflow based data platform that can take advantage of popular ML tools (Jupyter, Tensorflow, Spark) while creating an easy-to-manage/monitor As the field of data science grows in popularity, companies find themselves in need of a single common language that can connect their data science teams and data infrastructure teams. Data scientists want rapid iteration, infrastructure engineers want monitoring and security controls, and product owners want their solutions deployed in time for quarterly reports. This talk will discuss how to build an Airflow based data platform that can take advantage of popular ML tools (Jupyter, Tensorflow, Spark) while creating an easy-to-manage/monitor ecosystem for data infrastructure and support team. In this talk, we will take an idea from a single-machine Jupyter Notebook to a cross-service Spark + Tensorflow pipeline, to a canary tested, production-ready model served on Google Cloud Functions. We will show how Apache Airflow can connect all layers of a data team to deliver rapid results.

Pipelines on pipelines: Agile CI/CD workflows for Airflow DAGs

2020-07-01
session
Victor Shafran (Databand)

How do you create fast and painless delivery of new DAGs into production? When running Airflow at scale, it becomes a big challenge to manage the full lifecycle around your pipelines; making sure that DAGs are easy to develop, test, and ship into prod. In this talk, we will cover our suggested approach to building a proper CI/CD cycle that ensures the quality and fast delivery of production pipelines. CI/CD is the practice of delivering software from dev to prod, optimized for fast iteration and quality control. In the data engineering context, DAGs are just another piece of software that require some form of lifecycle management. Traditionally, DAGs have been thought of as relatively static, but the new wave of analytics and machine learning efforts require more agile DAG development, in line with how agile software engineering teams build and ship code. In this session, we will dive into the challenges of building CI/CD cycles for Airflow DAGs. We will focus on a pipeline that involves Apache Spark as an extra dimension of real-world complexity, walking through a typical flow of DAG authoring, debugging, and testing, from local to staging to prod environments. We will offer best practices and discuss open-source tools you can use to easily build your own smooth cycle for Airflow CI/CD.