talk-data.com

Topic

ETL/ELT

data_integration data_transformation data_loading

Activities

tagged

Activity Trend

40 peak/qtr

2020-Q1 2026-Q1

Top Events

Data Engineering Podcast 175 O'Reilly Data Engineering Books 70 Databricks DATA + AI Summit 2023 46 Data + AI Summit 2025 35 O'Reilly Data Science Books 21 AWS re:Invent 2024 14 Google Cloud Next '25 8 Airflow Summit 2025 7 Airflow Summit 2021 6 Microsoft Ignite 2025 6 Big Data LDN 2025 5 Airflow Summit 2020 5

Top Speakers

Tobias Macey 176 Brian Knight 7 Andy Leonard 6 Joe Reis (DeepLearning.AI) 5 María Carina Roldán 5 Erik Veerman 5 Jessica M. Moss 5 Al Martin (IBM) 5 Maxime Beauchemin (Preset) 5 Frank Khan Sullivan 4 Dan Harris (Cloudaeon) 4 Jon Cooke (Dataception) 4

Activities

Showing filtered results

All Video Podcast Book

Filtering by: Airflow Summit 2020 ×

AIP-31: Airflow functional DAG definition

2020-07-01 · Airflow Summit 2020

session

by Gerard Casas Saez (Twitter)

Airflow Python

Airflow does not currently have an explicit way to declare messages passed between tasks in a DAG. XCom are available but are hidden in execution functions inside the operator. AIP-31 proposes a way to make this message passing explicit in the DAG file and make it easier to reason about your DAG behaviour. In this talk, we will explore what other DSL are doing for message passing and how has that influenced AIP-31. We will explore the motivations behind explicit message passing as well as more proposals that can be built on top of it. In addition, we will explore a new way to define custom Python transformations using the task decorator proposed, and how this change may improve the extensibility of Airflow for more experimental ETL use cases.

Airflow: A beast character in the gaming world

2020-07-01 · Airflow Summit 2020

session

by Naresh Yegireddi (PlayStation) , Patricio Garza (PlayStation)

Airflow Analytics AWS Amazon EC2 Big Data Data Analytics Docker Python Spark

Being a pioneer for the past 25 years, SONY PlayStation has played a vital role in the Interactive Gaming Industry. Over 100+ million monthly active users, 100+ million PS-4 console sales along with thousands of game development partners across the globe, big-data problem is quite inevitable. This presentation talks about how we scaled Airflow horizontally which has helped us building a stable, scalable and optimal data processing infrastructure powered by Apache Spark, AWS ECS, EC2 and Docker. Due to the demand for processing large volumes of data and also to meet the growing Organization’s data analytics and usage demands, the data team at PlayStation took an initiative to build an open source big data processing infrastructure where Apache Spark in Python as the core ETL engine. Apache Airflow is the core workflow management tool for the entire eco system. We started with an Airflow application running on a single AWS EC2 instance to support parallelism of 16 with 1 scheduler and 1 worker and eventually scaled it to a bigger scheduler along with 4 workers to support a parallelism of 96, DAG concurrency of 96 and a worker task concurrency of 24. Containerized all the services on AWS ECS which gave us an ability to scale Airflow horizontally.

Airflow as an elastic ETL tool

2020-07-01 · Airflow Summit 2020

session

by Vicente Rubén Del Pino Ruiz (UnitedHealth Group) , Hendrik Kleine (Optum)

AI/ML Airflow Docker ELK SQL

In search of a better, modern, simplistic method of managing ETL’s processes and merging them with various AI and ML tasks, we landed on Airflow. We envisioned a new user friendly interface that can leverage dynamic DAG’s and reusable components to build an ETL tool that requires virtually no training. We built several template DAG’s and connectors for Airflow to typical data sources, like SQL Server. Then proceeded to build a modern interface on top that brings ETL build, scheduling and execution capabilities. Acknowledging Airflow is designed for task orchestration, we expanded our infrastructure to use K8 and Docker for elastic computing. Key to our solution is the ability to create ETL’s using only open source tools, whilst executing on-par or faster than commercial solutions and an interface so simple that ETL’s could be created in seconds.

Building reuseable and trustworthy ELT pipelines (A templated approach)

2020-07-01 · Airflow Summit 2020

session

by Nehil Jain (SnapTravel)

Airflow dbt Singer SQL

To improve automation of data pipelines, I propose a universal approach to ELT pipeline that optimizes for data integrity, extensibility, and speed to delivery. The workflow is built using open source tools and standards like Apache Airflow, Singer, Great Expectations, and DBT. Templating ETLs is challenging! The creation and maintenance of data pipelines in production require hard work to manage bugs in code and bad data. I like to propose a data pipeline pattern that can simplify building pipelines while optimizing for data integrity and observability. The workflow is built using open source tools like Singer, Great Expectations, and DBT. Goals: Make EL T simple and fast to implement Validate your assumptions of the data before you make it available for use Allow analysts/data scientists add pain-free contributions to EL T using SQL Generate data documentation, failure logs for quick recovery, and fixes outages in your pipeline Target Audience: Approachable to any level of developer Novice data personals interested in starting ELT workflow and learning about different tools of the ecosystem Intermediate+ developers interested in supercharging their pipeline with Write Audit Publish pattern and reducing pipeline debt

Keynote: How large companies use Airflow for ML and ETL pipelines

2020-07-01 · Airflow Summit 2020

session

by Tao Feng (Lyft | Airflow PMC member) , Dan Davydov (Twitter | Airflow PMC member) , Kevin Yang (Airbnb | Airflow PMC member)

AI/ML Airflow

In this talk, colleagues from Airbnb, Twitter and Lyft share details about how they are using Apache Airflow to power their data pipelines. Slides: Tao Feng (Lyft) Dan Davydov (Twitter)