talk-data.com talk-data.com

Event

Airflow Summit 2020

2020-07-01 Airflow Summit Visit website ↗

Activities tracked

5

Airflow Summit 2020 program

Filtering by: Python ×

Sessions & talks

Showing 1–5 of 5 · Newest first

Search within this event →

AIP-31: Airflow functional DAG definition

2020-07-01
session
Gerard Casas Saez (Twitter)

Airflow does not currently have an explicit way to declare messages passed between tasks in a DAG. XCom are available but are hidden in execution functions inside the operator. AIP-31 proposes a way to make this message passing explicit in the DAG file and make it easier to reason about your DAG behaviour. In this talk, we will explore what other DSL are doing for message passing and how has that influenced AIP-31. We will explore the motivations behind explicit message passing as well as more proposals that can be built on top of it. In addition, we will explore a new way to define custom Python transformations using the task decorator proposed, and how this change may improve the extensibility of Airflow for more experimental ETL use cases.

Airflow: A beast character in the gaming world

2020-07-01
session
Naresh Yegireddi (PlayStation) , Patricio Garza (PlayStation)

Being a pioneer for the past 25 years, SONY PlayStation has played a vital role in the Interactive Gaming Industry. Over 100+ million monthly active users, 100+ million PS-4 console sales along with thousands of game development partners across the globe, big-data problem is quite inevitable. This presentation talks about how we scaled Airflow horizontally which has helped us building a stable, scalable and optimal data processing infrastructure powered by Apache Spark, AWS ECS, EC2 and Docker. Due to the demand for processing large volumes of data and also to meet the growing Organization’s data analytics and usage demands, the data team at PlayStation took an initiative to build an open source big data processing infrastructure where Apache Spark in Python as the core ETL engine. Apache Airflow is the core workflow management tool for the entire eco system. We started with an Airflow application running on a single AWS EC2 instance to support parallelism of 16 with 1 scheduler and 1 worker and eventually scaled it to a bigger scheduler along with 4 workers to support a parallelism of 96, DAG concurrency of 96 and a worker task concurrency of 24. Containerized all the services on AWS ECS which gave us an ability to scale Airflow horizontally.

Democratised data workflows at scale

2020-07-01
session
Mihail Petkov (Financial Times) , Emil Todorov (Financial Times)

Financial Times is increasing its digital revenue by allowing business people to make data-driven decisions. Providing an Airflow based platform where data engineers, data scientists, BI experts and others can run language agnostic jobs was a huge swing. One of the most successful steps in the platform’s development was building our own execution environment, allowing stakeholders to self deploy jobs without cross team dependencies on top of the unlimited scale of Kubernetes. In this talk we share how we have integrated and extended Airflow at Financial Times. The main topics we will cover include: Providing team level security isolation Removing cross team dependencies Creating execution environment for independently creating and deploying R, Python, JAVA, Spark, etc jobs Reducing latency when sharing data between task instances Integrating all these features on top of Kubernetes

Demo: Reducing the lines, a visual DAG editor

2020-07-01
session
Traey Hatch (Onica)

In this talk I will introduce a DAG authoring and editing tool for Airflow that we have built. Installed as a plugin, this tool allows users to author DAGs compose existing operators and hooks with virtually no Python experience. We walk through a demo of DAG authorship and deployment, and spend time reviewing the underlying open-source standards used and the general approach that was taken to develop the code. In addition to allowing dags to be created in a visual editor, the underlying tech enables Airflow DAGs to be described programmatically in YAML or JSON. DAGs described there can be saved in backing databases instead of Python files.

From Zero to Airflow: bootstrapping a ML platform

2020-07-01
session
Noam Elfanbaum (Bluevine)

At Bluevine we use Airflow to drive our ML platform. In this talk, Noam presents the challenges and gains we had at transitioning from a single server running Python scripts with cron to a full blown Airflow setup. This includes: supporting multiple Python versions, event driven DAGs, performance issues and more! Some of the points that I’ll cover are: Supporting multiple Python versions Event driven DAGs Airflow Performance issues and how we circumvented them Building Airflow plugins to enhance observability Monitoring Airflow using Grafana CI for Airflow DAGs (super useful!) Patching Airflow scheduler Slides