Airflow does not currently have an explicit way to declare messages passed between tasks in a DAG. XCom are available but are hidden in execution functions inside the operator. AIP-31 proposes a way to make this message passing explicit in the DAG file and make it easier to reason about your DAG behaviour. In this talk, we will explore what other DSL are doing for message passing and how has that influenced AIP-31. We will explore the motivations behind explicit message passing as well as more proposals that can be built on top of it. In addition, we will explore a new way to define custom Python transformations using the task decorator proposed, and how this change may improve the extensibility of Airflow for more experimental ETL use cases.
talk-data.com
Topic
ETL/ELT
ETL/ELT
5
tagged
Activity Trend
Top Events
Being a pioneer for the past 25 years, SONY PlayStation has played a vital role in the Interactive Gaming Industry. Over 100+ million monthly active users, 100+ million PS-4 console sales along with thousands of game development partners across the globe, big-data problem is quite inevitable. This presentation talks about how we scaled Airflow horizontally which has helped us building a stable, scalable and optimal data processing infrastructure powered by Apache Spark, AWS ECS, EC2 and Docker. Due to the demand for processing large volumes of data and also to meet the growing Organization’s data analytics and usage demands, the data team at PlayStation took an initiative to build an open source big data processing infrastructure where Apache Spark in Python as the core ETL engine. Apache Airflow is the core workflow management tool for the entire eco system. We started with an Airflow application running on a single AWS EC2 instance to support parallelism of 16 with 1 scheduler and 1 worker and eventually scaled it to a bigger scheduler along with 4 workers to support a parallelism of 96, DAG concurrency of 96 and a worker task concurrency of 24. Containerized all the services on AWS ECS which gave us an ability to scale Airflow horizontally.
In search of a better, modern, simplistic method of managing ETL’s processes and merging them with various AI and ML tasks, we landed on Airflow. We envisioned a new user friendly interface that can leverage dynamic DAG’s and reusable components to build an ETL tool that requires virtually no training. We built several template DAG’s and connectors for Airflow to typical data sources, like SQL Server. Then proceeded to build a modern interface on top that brings ETL build, scheduling and execution capabilities. Acknowledging Airflow is designed for task orchestration, we expanded our infrastructure to use K8 and Docker for elastic computing. Key to our solution is the ability to create ETL’s using only open source tools, whilst executing on-par or faster than commercial solutions and an interface so simple that ETL’s could be created in seconds.
To improve automation of data pipelines, I propose a universal approach to ELT pipeline that optimizes for data integrity, extensibility, and speed to delivery. The workflow is built using open source tools and standards like Apache Airflow, Singer, Great Expectations, and DBT. Templating ETLs is challenging! The creation and maintenance of data pipelines in production require hard work to manage bugs in code and bad data. I like to propose a data pipeline pattern that can simplify building pipelines while optimizing for data integrity and observability. The workflow is built using open source tools like Singer, Great Expectations, and DBT. Goals: Make EL T simple and fast to implement Validate your assumptions of the data before you make it available for use Allow analysts/data scientists add pain-free contributions to EL T using SQL Generate data documentation, failure logs for quick recovery, and fixes outages in your pipeline Target Audience: Approachable to any level of developer Novice data personals interested in starting ELT workflow and learning about different tools of the ecosystem Intermediate+ developers interested in supercharging their pipeline with Write Audit Publish pattern and reducing pipeline debt
In this talk, colleagues from Airbnb, Twitter and Lyft share details about how they are using Apache Airflow to power their data pipelines. Slides: Tao Feng (Lyft) Dan Davydov (Twitter)