talk-data.com

Topic

Airflow

Apache Airflow

workflow_management data_orchestration etl

Activities

tagged

Activity Trend

157 peak/qtr

2020-Q1 2026-Q1

Top Events

Airflow Summit 2025 139 Data Engineering Podcast 122 Airflow Summit 2024 92 Airflow Summit 2023 81 Airflow Summit 2022 52 Airflow Summit 2021 52 Airflow Summit 2020 39 O'Reilly Data Engineering Books 11 DATA MINER Big Data Europe Conference 2020 5 dbt Coalesce 2022 5 Airflow Monthly Virtual Town Hall- August 4 Airflow Monthly Virtual Town Hall- March 4

Top Speakers

Tobias Macey 122 Jarek Potiuk (Apache Software Foundation) 15 Kaxil Naik 12 Ash Berlin-Taylor (Astronomer) 11 Rafal Biegacz 10 Vikram Koka (Astronomer) 9 John Jackson 9 Brent Bovenzi (Astronomer) 7 Amogh Rajesh Desai 7 Maxime Beauchemin (Preset) 7 Tatiana Al-Chueyr Martins (Astronomer) 6 Jens Scheffler 6

Activities

Showing filtered results

All Video Podcast Book

Filtering by: Airflow Summit 2021 ×

Contributing to Apache Airflow | Journey to becoming Airflow's leading contributor

2021-07-01 · Airflow Summit 2021

session

by Kaxil Naik

GitHub Python

From not knowing Python (let alone Airflow), and from submitting the first PR that fixes typo to becoming Airflow Committer, PMC Member, Release Manager, and #1 Committer this year, this talk walks through Kaxil’s journey in the Airflow World. The second part of this talk explains: how you can also start your OSS journey by contributing to Airflow Expanding familiarity with a different part of the Airflow codebase Continue committing regularly & steadily to become Airflow Committer. (including talking about current Guidelines of becoming a Committer) Different mediums of communication (Dev list, users list, Slack channel, Github Discussions etc)

Create Your Custom Secrets Backend for Apache Airflow - A guided tour into Airflow codebase

2021-07-01 · Airflow Summit 2021

session

This talk aims to share how Airflow’s secrets backend works, and how users can create their custom secret backends for their specific use cases & technology stack.

Creating Data Pipelines with Elyra, a visual DAG composer and Apache Airflow

2021-07-01 · Airflow Summit 2021

session

by Alan Chin

AI/ML GitHub Kubernetes Python

This presentation will detail how Elyra creates Jupyter Notebook, Python and R script- based pipelines without having to leave your web browser. The goal of using Elyra is to help construct data pipelines by surfacing concepts and patterns common in pipeline construction into a familiar, easy to navigate interface for Data Scientists and Engineers so they can create pipelines on their own. In Elyra’s Pipeline Editor UI, portions of Apache Airflow’s domain language are surfaced to the user and either made transparent or understandable through the use of tooltips or helpful notes in the proper context during pipeline construction. With these features, Elyra can rapidly prototype data workflows without the need to know or write any pipeline code. Lastly, we will look at what features we have planned on our roadmap for Airflow, including more robust Kubernetes integration and support for runtime specific components/operators. Project Home: https://github.com/elyra-ai/elyra

Customizing Xcom to enhance data sharing between tasks

2021-07-01 · Airflow Summit 2021

session

by Vikram Koka (Astronomer) , Ephraim Anierobi

API Cloud Computing Cloud Storage JSON Pandas S3

In Apache Airflow, Xcom is the default mechanism for passing data between tasks in a DAG. In practice, this has been restricted to small data elements, since the Xcom data is persisted in the Airflow metadatabase and is constrained by database and performance limitations. With the new TaskFlow API introduced in Airflow 2.0, it is seamless to pass data between tasks and the use of Xcom is invisible. However, the ability to pass data is restricted to a relatively small set of data types which can be natively converted in JSON. This tutorial describes how to go beyond these limitations by developing and deploying a Custom Xcom backend within Airflow to enable the sharing of large and varied data elements such as Pandas data frames between tasks in a data pipeline, using a cloud storage such as Google Storage or Amazon S3.

Dataclasses as Pipeline Definitions in Airflow

2021-07-01 · Airflow Summit 2021

session

by Madison Swain-Bowden (Automattic)

ETL/ELT MySQL Python Redshift

We will describe how we were able to build a system in Airflow for MySQL to Redshift ETL pipelines defined in pure Python using dataclasses. These dataclasses are then used to dynamically generate DAGs depending on pipeline type. This setup allows us to implement robust testing, validation, alerts, and documentation for our pipelines. We will also describe the performance improvements we achieved by upgrading to Airflow 2.0.

Data Lineage with Apache Airflow using OpenLineage

2021-07-01 · Airflow Summit 2021

session

by Willy Lulciuc (WeWork) , Julien Le Dem (Astronomer)

Data Quality

If you manage a lot of data, and you’re attending this summit, you likely rely on Apache Airflow to do a lot of the heavy lifting. Like any powerful tool, Apache Airflow allows you to accomplish what you couldn’t before… but also creates new challenges. As DAGs pile up, complexity layers on top of complexity and it becomes hard to grasp how a failed or delayed DAG will affect everything downstream. In this session we will provide a crash course on OpenLineage, an open platform for metadata management and data lineage analysis. We’ll show how capturing metadata with OpenLineage can help you maintain inter-DAG dependencies, capture data on historical runs, and minimize data quality issues.

Deep dive in to the Airflow scheduler

2021-07-01 · Airflow Summit 2021

session

by Ash Berlin-Taylor (Astronomer)

The scheduler is the core of Airflow, and it’s a complex beast. In this session we will go through the scheduler in some detail; how it works; what the communication paths are and what processing is done where.

Discussion panel: Keep your Airflow secure

2021-07-01 · Airflow Summit 2021

session

by Jarek Potiuk (Apache Software Foundation) , Ash Berlin-Taylor (Astronomer) , Dolev Farhi , Tomasz Urbaszek

Cyber Security

You might have heard some recent news about ransomware attacks for many companies. Quite recently the U. S. Department of Justice has elevated the priority of investigations of ransomware attacks to the same level as terrorism. Certainly security aspects of running software and so called “supply-chain attacks” have made a press recently. Also, you might have read recently about security researcher who made USD 13,000 via bounties by finding and contacting companies that had old, un-patched versions of Airflow - even if the ASF security process was great and PMC of Airflow has fixed those long time ago. If any of this rings a bell, then this session is for you. In this session Dolev (security expert and researchers who submitted security issues recently to Airflow), Ash and Jarek (Airflow PMC members) will discuss the state of security and best practices for keeping your Airflow secure and why it is important. The discussion will be moderated by Tomasz Urbaszek, Airflow PMC member. You can get a glimpse of what they will talk about through this blog post .

Drift Bio: The Future of Microbial Genomics with Apache Airflow

2021-07-01 · Airflow Summit 2021

session

by Eli Scheele

Cloud Computing

In recent years, the bioinformatics world has seen an explosion in genomic analysis as gene sequencing technologies have become exponentially cheaper. Tests that previously would have cost tens of thousands of dollars will soon run at pennies per sequence. This glut of data has exposed a notable bottleneck in the current suite of technologies available to bioinformaticians. At Drift Biotechnologies, we use Apache Airflow to transition traditionally on-premise large scale data and deep learning workflows for bioinformatics to the cloud, with an emphasis on workflows and data from next generation sequencing technologies.

Dynamic Security Roles in Airflow for Multi-Tenancy

2021-07-01 · Airflow Summit 2021

session

by Mark Merling , Sean Lewis

Cyber Security

Multi-tenant Airflow instances can help save costs for an organization. This talk will walk through how we dynamically assigned roles to users based on groups in Active Directory so that teams would have access to DAGs they created in the UI on our multi-tenant Airflow instance. We created our own custom AirflowSecurityManager class in order to achieve this that ultimately ties LDAP and RBAC together.

Event-based Scheduling Based on Airflow

2021-07-01 · Airflow Summit 2021

session

by Wuchao Chen

Airflow scheduler uses DAG definitions to monitor the state of tasks in the metadata database, and triggers the task instances whose dependencies have been met. It is based on state of dependencies scheduling. The idea of event based scheduling is to let the operators send events to the scheduler to trigger a scheduling action, such as starting jobs, stopping jobs and restarting jobs. Event based scheduling allows potential support for richer scheduling semantics such as periodic execution and manual trigger at per operator granularity.

Guaranteeing pipeline SLAs and data quality standards with Databand

2021-07-01 · Airflow Summit 2021

session

by Josh Benamram (Databand.ai) , Vinoo Ganesh (Veraset)

AI/ML Data Engineering Data Quality

We’ve all heard the phrase “data is the new oil.” But really imagine a world where this analogy is more real, where problems in the flow of data - delays, low quality, high volatility - could bring down whole economies? When data is the new oil with people and businesses similarly reliant on it, how do you avoid the fires, spills, and crises? As data products become central to companies’ bottom line, data engineering teams need to create higher standards for the availability, completeness, and fidelity of their data. In this session we’ll demonstrate how Databand helps organizations guarantee the health of their Airflow pipelines. Databand is a data pipeline observability system that monitors SLAs and data quality issues, and proactively alerts users on problems to avoid data downtime. The session will be led by Josh Benamram, CEO and Cofounder of Databand.ai. Josh will be joined by Vinoo Ganesh, an experienced software engineer, system architect, and current CTO of Veraset, a data-as-a-service startup focused on understanding the world from a geospatial perspective. Join to see how Databand.ai can help you create stable, reliable pipelines that your business can depend on!

Introducing Viewflow: a framework for writing data models without writing Airflow code

2021-07-01 · Airflow Summit 2021

session

by Gaëtan Podevijn

Data Science GitHub Python SQL

In this talk, we present Viewflow, an open-source Airflow-based framework that allows data scientists to create materialized views in SQL, R, and Python without writing Airflow code. We will start by explaining what problem does Viewflow solve: writing and maintaining complex Airflow code instead of focusing on data science. Then we will see how Viewflow solves that problem. We will continue by showing how to use VIewflow with several real-world examples. Finally, we will see what the upcoming features of Viewflow are! Resources: Announcement blog post: https://medium.com/datacamp-engineering/viewflow-fe07353fa068 GitHub repo: https://github.com/datacamp/viewflow

Lessons Learned while Migrating Data Pipelines from Enterprise Schedulers to Airflow

2021-07-01 · Airflow Summit 2021

session

by Hari Nair (Unravel) , Shivnath Babu (Unravel)

Cloud Computing ETL/ELT Informatica

Digital transformation, application modernization, and data platform migration to the cloud are key initiatives in most enterprises today. These initiatives are stressing the scheduling and automation tools in these enterprises to the point that many users are looking for better solutions. A survey revealed that 88% of users believe that their business will benefit from an improved automation strategy across technology and business. Airflow has an excellent opportunity to capture mindshare and emerge as the leading solution here. At Unravel, we are seeing the trend where many of our enterprise customers are at various stages of migrating to Airflow from their enterprise schedulers or ETL/ELT orchestration tools like Autosys, Informatica, Oozie, Pentaho, and Tidal. In this talk, we will share lessons learned and best practices found in the entire pipeline migration life-cycle which includes: (i) The evaluation process which led to picking Airflow, including certain aspects where Airflow can do better (ii) The challenges in discovering and understanding all components and dependencies that need to be considered in the migration (iii) The challenges arising during the pipeline code and data migration, especially, in getting a single-pane-of-glass and apples-to-apples views to track the progress of the migration (iv) The challenges in ensuring that the pipelines that have been migrated to Airflow are able to perform and scale on par or better compared to what existed previously

Looking ahead: What comes after Airflow 2.0?

2021-07-01 · Airflow Summit 2021

session

by Aizhamal Nurmamat kyzy , Ash Berlin-Taylor (Astronomer)

Modernize a decade old pipeline with Airflow 2.0

2021-07-01 · Airflow Summit 2021

session

by Dmitry Suvorov , Stas Bytsko , Kuntal Basu , QP Hou

Cloud Computing MySQL

As a follow up for https://airflowsummit.org/sessions/teaching-old-dag-new-tricks/ , in this talk, we would like to share a happy ending story on how Scribd fully migrated its data platform to the cloud and Airflow 2.0. We will talk about data validation tools and task trigger customizations the team built to smooth out the transition. We will share how we completed the Airflow 2.0 migration started with an unsupported MySQL version and metrics to prove why everyone should perform the upgrade. Lastly, we will discuss how large scale backfills (10 years worth of run) are managed and automated at Scribd.

MWAA: Design Choices and Road Ahead

2021-07-01 · Airflow Summit 2021

session

by John Jackson , Subash Canapathy

An informal and fun chat about the journey that we took and the decisions that we made in building Amazon Managed Workflows for Apache Airflow. We will talk about Our first tryst with understanding Airflow Talking to Amazon Data Engineers and how they ran workflows at scale Key design decisions and the reasons behind them Road ahead, and what we dream about for future of Apache Airflow. Open-Source tenets and commitment from the team We will leave time at the end for a short AMA/Questions.

Orchestrating ELT with Fivetran and Airflow

2021-07-01 · Airflow Summit 2021

session

by Nick Acosta

ETL/ELT Fivetran Modern Data Stack

At Fivetran, we are seeing many organizations adopt the Modern Data Stack to suit the breadth of their data needs. However, as incoming data sources begin to scale, it can be hard to manage and maintain the environment, with more time spent repairing and reengineering old data pipelines than building new ones. This talk will introduce a number of new Airflow Providers, including the airflow-provider-fivetran, and discuss some of the benefits and considerations we are seeing data engineers, data analysts, and data scientist experience in doing so.

Pinterest’s Migration Journey

2021-07-01 · Airflow Summit 2021

session

by Dinghang Yu (Pinterest) , Yulei Li (Pinterest) , Euccas Chen (Pinterest) , Ashim Shrestha (Pinterest) , Ace Haidrey (Pinterest)

Kubernetes

Last year, we were able to share why we have selected Airflow to be our next generation workflow system. This year, we will dive into the journey of migrating over 3000+ workflows and 45000+ tasks to Airflow. We will discuss the infrastructure additions to support such loads, the partitioning and prioritization of different workflow tiers defined in house, the migration tooling we built to get users to onboard, the translation layers between our old DSLs and the new, our internal k8s executor to leverage Pinterest’s kubernetes fleet, and more. We want to share the challenges both technically and usability wise to get such large migrations over the course of a year, and how we overcame it to successfully migrate 100% of the workflows to our inhouse workflow platform branded Spinner.

Productionizing ML Pipelines with Airflow, Kedro, and Great Expectations

2021-07-01 · Airflow Summit 2021

session

by Kenten Danas

AI/ML Data Quality Data Science Python

Machine Learning models can add value and insight to many projects, but they can be challenging to put into production due to problems like lack of reproducibility, difficulty maintaining integrations, and sneaky data quality issues. Kedro, a framework for creating reproducible, maintainable, and modular data science code, and Great Expectations, a framework for data validations, are two great open-source Python tools that can address some of these problems. Both integrate seamlessly with Airflow for flexible and powerful ML pipeline orchestration. In this talk we’ll discuss how you can leverage existing Airflow provider packages to integrate these tools to create sustainable, production-ready ML models.

Page 2 of 3

← Previous

1 2 3