Event

Airflow Summit 2020

2020-07-01 Airflow Summit Visit website ↗

Activities tracked

3

Airflow Summit 2020 program

Filtering by: AWS ×

Top Speakers

Jarek Potiuk 3 Maxime Beauchemin 3 Kaxil Naik 2 Daniel Imberman 2 Bolke de Bruin 2 Kevin Yang 2 Ash Berlin-Taylor 1 Rafal Biegacz 1 Leah Cole 1 Ace Haidrey 1 Alaeddine Maaoui 1 Dinghang Yu 1

Sessions & talks

Showing 1–3 of 3 · Newest first

Search within this event →

Airflow: A beast character in the gaming world

2020-07-01

session

Naresh Yegireddi (PlayStation) , Patricio Garza (PlayStation)

Airflow Analytics AWS Amazon EC2 Big Data Data Analytics

Being a pioneer for the past 25 years, SONY PlayStation has played a vital role in the Interactive Gaming Industry. Over 100+ million monthly active users, 100+ million PS-4 console sales along with thousands of game development partners across the globe, big-data problem is quite inevitable. This presentation talks about how we scaled Airflow horizontally which has helped us building a stable, scalable and optimal data processing infrastructure powered by Apache Spark, AWS ECS, EC2 and Docker. Due to the demand for processing large volumes of data and also to meet the growing Organization’s data analytics and usage demands, the data team at PlayStation took an initiative to build an open source big data processing infrastructure where Apache Spark in Python as the core ETL engine. Apache Airflow is the core workflow management tool for the entire eco system. We started with an Airflow application running on a single AWS EC2 instance to support parallelism of 16 with 1 scheduler and 1 worker and eventually scaled it to a bigger scheduler along with 4 workers to support a parallelism of 96, DAG concurrency of 96 and a worker task concurrency of 24. Containerized all the services on AWS ECS which gave us an ability to scale Airflow horizontally.

Migrating Airflow-based Spark jobs to Kubernetes - the native way

2020-07-01

session

Roi Teveth (Nielsen Identity Engine) , Itai Yaffe (Nielsen Identity Engine)

Airflow AWS Amazon EMR GCP Kubernetes Spark

At Nielsen Identity Engine, we use Spark to process 10’s of TBs of data. Our ETLs, orchestrated by Airflow, spin-up AWS EMR clusters with thousands of nodes per day. In this talk, we’ll guide you through migrating Spark workloads to Kubernetes with minimal changes to Airflow DAGs, using the open-sourced GCP Spark-on-K8s operator and the native integration we recently contributed to the Airflow project.

Teaching an old DAG new tricks

2020-07-01

session

QP Hou

Airflow AWS Cloud Computing Datadog Git PagerDuty

Scribd is migrating its data pipeline from an in house system to Airflow. It’s a one big giant data pipeline consisting of more than 1,500 tasks. In this talk, I would like to share couple best practices on setting up a cloud native Airflow deployment in AWS. For those who are interested in migrating a non-trivial data pipeline to Airflow, I will also share how Scribd plans and executes the migration. Here are some of the topics that will be covered: How to setup a highly available Airflow cluster in AWS using both ECS and EKS with Terraform. How to manage Airflow DAGs across multiple git repositories. How we manage Airflow variables using a custom Airflow Terraform provider. Best practices on monitoring multiple Airflow clusters with Datadog and Pagerduty. How to Airflow to make it feature parity with Scribd’s in house orchestration system. How to plan and execute non-trivial data pipeline migrations. We transcompiled internal DSL to Airflow DAG to simulate what a real run will look like to surface performance issues early in the process. How we fixed an Airflow performance bottleneck so our giant DAG can be properly rendered in Web UI. For detailed deep dives on some of topics mentioned above, please check out our blog post series at https://tech.scribd.com/tag/airflow-series/ [Slides] ( https://docs.google.com/presentation/d/e/2PACX-1vRb-iH5NX2d7m-rQ7WGc6XlRvRCADwXq2hdjRjRuJ5h7e9ybfoUA13ytxpHgx7JG815fIKEE-QKuRUV/pub?start=false&loop=false&delayms=3000 )

talk-data.com

Airflow Summit 2020

Top Topics

Top Speakers

Airflow: A beast character in the gaming world

Migrating Airflow-based Spark jobs to Kubernetes - the native way

Teaching an old DAG new tricks