talk-data.com talk-data.com

Event

Airflow Summit 2020

2020-07-01 Airflow Summit Visit website ↗

Activities tracked

3

Airflow Summit 2020 program

Filtering by: AWS ×

Sessions & talks

Showing 1–3 of 3 · Newest first

Search within this event →

Airflow: A beast character in the gaming world

2020-07-01
session
Naresh Yegireddi (PlayStation) , Patricio Garza (PlayStation)

Being a pioneer for the past 25 years, SONY PlayStation has played a vital role in the Interactive Gaming Industry. Over 100+ million monthly active users, 100+ million PS-4 console sales along with thousands of game development partners across the globe, big-data problem is quite inevitable. This presentation talks about how we scaled Airflow horizontally which has helped us building a stable, scalable and optimal data processing infrastructure powered by Apache Spark, AWS ECS, EC2 and Docker. Due to the demand for processing large volumes of data and also to meet the growing Organization’s data analytics and usage demands, the data team at PlayStation took an initiative to build an open source big data processing infrastructure where Apache Spark in Python as the core ETL engine. Apache Airflow is the core workflow management tool for the entire eco system. We started with an Airflow application running on a single AWS EC2 instance to support parallelism of 16 with 1 scheduler and 1 worker and eventually scaled it to a bigger scheduler along with 4 workers to support a parallelism of 96, DAG concurrency of 96 and a worker task concurrency of 24. Containerized all the services on AWS ECS which gave us an ability to scale Airflow horizontally.

Migrating Airflow-based Spark jobs to Kubernetes - the native way

2020-07-01
session
Roi Teveth (Nielsen Identity Engine) , Itai Yaffe (Nielsen Identity Engine)

At Nielsen Identity Engine, we use Spark to process 10’s of TBs of data. Our ETLs, orchestrated by Airflow, spin-up AWS EMR clusters with thousands of nodes per day. In this talk, we’ll guide you through migrating Spark workloads to Kubernetes with minimal changes to Airflow DAGs, using the open-sourced GCP Spark-on-K8s operator and the native integration we recently contributed to the Airflow project.

Teaching an old DAG new tricks

2020-07-01
session

Scribd is migrating its data pipeline from an in house system to Airflow. It’s a one big giant data pipeline consisting of more than 1,500 tasks. In this talk, I would like to share couple best practices on setting up a cloud native Airflow deployment in AWS. For those who are interested in migrating a non-trivial data pipeline to Airflow, I will also share how Scribd plans and executes the migration. Here are some of the topics that will be covered: How to setup a highly available Airflow cluster in AWS using both ECS and EKS with Terraform. How to manage Airflow DAGs across multiple git repositories. How we manage Airflow variables using a custom Airflow Terraform provider. Best practices on monitoring multiple Airflow clusters with Datadog and Pagerduty. How to Airflow to make it feature parity with Scribd’s in house orchestration system. How to plan and execute non-trivial data pipeline migrations. We transcompiled internal DSL to Airflow DAG to simulate what a real run will look like to surface performance issues early in the process. How we fixed an Airflow performance bottleneck so our giant DAG can be properly rendered in Web UI. For detailed deep dives on some of topics mentioned above, please check out our blog post series at https://tech.scribd.com/tag/airflow-series/ [Slides] ( https://docs.google.com/presentation/d/e/2PACX-1vRb-iH5NX2d7m-rQ7WGc6XlRvRCADwXq2hdjRjRuJ5h7e9ybfoUA13ytxpHgx7JG815fIKEE-QKuRUV/pub?start=false&loop=false&delayms=3000 )