Airflow Summit 2023

Reducing Cost with Async/Deferrable Operators

2023-07-01

session

Zachary Bannor

Airflow

At Condé Nast, we have heavily leveraged async/deferrable operators to reduce our Airflow-associated costs. By implementing async/deferrable operators in all of our pipelines, we have been able to realize a cost reduction of 54% compared with our previous usage of non-async/deferrable operators.

Reliable Airflow DAG Design When Building a Time-Series Data Lakehouse

2023-07-01

session

Sung Yun

Airflow Data Lakehouse

As a team that has built a Time-Series Data Lakehouse at Bloomberg, we looked for a workflow orchestration tool that could address our growing scheduling requirements. We needed a tool that was reliable and scalable, but also could alert on failures and delays to enable users to recover quickly from them. From using triggers over simple sensors to implementing custom SLA monitoring operators, we explore our choices in designing Airflow DAGs to create a reliable data delivery pipeline that is optimized for failure detection and remediation.

Simplifying the Creation of Data Science Pipelines with Airflow

2023-07-01

session

Soren Archibald , Jay Thomas

Airflow Cloud Computing Data Engineering Data Science DevOps

The ability to create DAGs programmatically opens up new possibilities for collaboration between Data Science and Data Engineering. Engineering and DevOPs are typically incentivized by stability whereas Data Science is typically incentivized by fast iteration and experimentation. With Airflow, it becomes possible for engineers to create tools that allow Data Scientists and Analysts to create robust no-code/low-code data pipelines for feature stores. We will discuss Airlow as a means of bridging the gap between data infrastructure and modeling iteration as well as examine how a Qbiz customer did just this by creating a tool which allows Data Scientists to build features, train models and measure performance, using cloud services, in parallel.

Sketching Pipelines Using DAG Authoring UI

2023-07-01

session

Shubham Raj , Amogh Rajesh Desai

Airflow API Data Engineering Python Spark

Cloudera Data Engineering (CDE) is a serverless service for Cloudera Data Platform that allows you to submit various Spark jobs and Airflow DAGs to an auto-scaling cluster. Running your workloads as Python DAG files may be the usual, but not the most convenient way for some users as it involves a lot of background around syntaxes, the programming language, aesthetics of Airflow, etc. The DAG Authoring UI is a tool built on top of Airflow APIs to allow one to use a graphical user interface to create, manage, and destroy complex DAGs. The DAG authoring UI will give one the ability to perform tasks on Airflow without really having to know DAG structure, Python programming language, and the internals of Airflow. CDE has identified multiple operators to perform various tasks on Airflow by carefully categorising the use cases. The operators range from BashOperator, PythonOperator, CDEJobRunOperator, CDWJobRunOperator Most use cases can be run as combinations of the operators provided.

Supercharge Your Data Testing with a Fully Open Stack

2023-07-01

session

Iddo Avneri

Airflow

Are you tired of spending countless hours testing your data pipelines, only to find that they don’t work as expected? Do you wish there was a better way to manage your data versions and streamline your testing processes? If so, this presentation is for you! Join us as we explore the problem domain of testing environments for data pipelines and take a deep dive into the available tools currently in use. We’ll introduce you to the game-changing concepts of data versioning and lakeFS and show you how to integrate these tools with Airflow to revolutionize your testing workflows. But don’t just take our word for it - witness this power firsthand with a live demo of testing an Airflow DAG on a snapshot of production data, providing a practical demo of the tools and techniques covered in the presentation. So don’t miss out on this opportunity to supercharge your testing processes and take your data pipelines to the next level.

Supporting the Vast Airflow Community: Lessons learned from over 100 Airflow webinars

2023-07-01

session

Kenten Danas

Airflow Astronomer

Astronomer has hosted over 100 Airflow webinars designed to educate and inform the community on best practices, use cases, and new features. The goal of these events is to increase Airflow’s adoption and ensure everybody, from new users to experienced power users, can keep up with a project that is evolving faster than ever. When new releases come out every few months, it can be easy to get stuck in past versions of Airflow. Instead, we want existing users to know how new features can make their lives easier, new users to know that Airflow can support their use case, and everybody to know how to implement the features they need and get them to production. This talk will cover some of the key learnings we’ve gathered from 2.5 years of conducting webinars aimed at supporting the community in growing their Airflow use, including how to best cater DevRel efforts to the many different types of Airflow users and how to effectively push for the adoption of new Airflow features.

Talking With Management About Open Source

2023-07-01

session

Rich Bowen

For those of us who already know how important open source is, it can be challenging to persuasively make the case to management, because we assume that everyone already knows the basics. This can work against us, confusing our audience and making us come across as condescending or concerned about irrelevant lofty philosophical points. In this talk, we take it back to the basics. What does management actually need to know about open source, why it matters, and how to make decisions about consuming open source, contributing to open source, and open sourcing company code?

Testing Airflow DAGs with Dagtest

2023-07-01

session

Aldo Orozco Gomez , Victor Chiapaikeo

Airflow API GCP Python

For the dag owner, testing Airflow DAGs can be complicated and tedious. Kubectl cp your dag from local to pod, exec into the pod, and run a command? Install breeze? Why pull the Airflow image and start up the webserver / scheduler / triggerer if all we want is to test the addition of a new task? It doesn’t have to be this hard. At Etsy, we’ve simplified testing dags for the dag owner with dagtest. Dagtest is a Python package that we house on our internal PyPi. It is a small client binary that makes HTTP requests to a test API. The test API is a simple Flask server that receives these requests and builds pods to run airflow dags backfill commands based on the options provided via dagtest. The simplest of these is a dry-run. Typically, users run test runs where the dag executes E2E for a single ds. Equally important is the environment setup. We use an adhoc Airflow instance in a separate GCP environment with an SA that cannot write to Production buckets. This talk will discuss both.

The Why and How of Running a Self-Managed Airflow on Kubernetes

2023-07-01

session

Parnab Basak (Amazon Web Services)

Airflow Cloud Computing Kubernetes Terraform

Today, all major cloud service providers and 3rd party providers include Apache Airflow as a managed service offering in their portfolios. While these cloud based solutions help with the undifferentiated heavy lifting of environment management, some data teams are also looking to operate self-managed Airflow instances to satisfy specific differentiated capabilities. In this session, we would talk about: Why should you might need to run self managed Airflow The available deployment options (with emphasis on Airflow on Kubernetes) How to deploy Airflow on Kubernetes using automation (Helm Charts & Terraform) Developer experience (sync DAGs using automation) Operator experience (Observability) Owned responsibilities and Tradeoffs A thorough understanding would help you understand the end-to-end perspectives of operating a highly available and scalable self managed Airflow environment to meet your ever growing workflow needs.

Things to Consider When Building an Airflow Service

2023-07-01

session

Pete DeJoy (Astronomer) , Viraj Parekh

Airflow Cyber Security

Data platform teams often find themselves in a situation where they have to provide Airflow as a service to downstream teams, as more users and use cases in their organization require an orchestrator. In these situations, it’s giving each team it’s own Airflow environment can unlock velocity and actually be lower overhead to maintain than a monolithic environment. This talk will be about things to keep in mind when building an Airflow service that supports several environments, persona of users, and use cases. Namely, we’ll discuss principles to keep in mind when balancing centralized control over the data platform with decentralized teams using Airflow in a way that they’ll need. This will include things around observability, developer productivity, security, and infrastructure. We’ll also talk about day 2 concerns around overheard, infrastructure maintenance, and other tradeoffs to consider.

To Debug a DAG: The Airflow local dev story

2023-07-01

session

Daniel Imberman

Airflow Python

As much as we love airflow, local development has been a bit of a white whale through much of its history. Until recently, Airflow’s local development experience has been hindered by the need to spin up a scheduler and webserver. In this talk, we will explore the latest innovation in Airflow local development, namely the “dag.test()” functionality introduced in Airflow 2.5. We will delve into practical applications of “dag.test()”, which empowers users to locally run and debug Airflow DAGs on a single python process. This new functionality significantly improves the development experience, enabling faster iteration and deployment. In this presentation, we will discuss: How to leverage IDE support for code completion, linting, and debugging; Techniques for inspecting and debugging DAG output, Best practices for unit testing DAGs and their underlying functions. Accessible to Airflow users of all levels, join us as we explore the future of Airflow local development!

Traps and Misconceptions of Running Reliable Workloads in Apache Airflow

2023-07-01

session

Bartosz Jankiewicz

Airflow Cloud Computing Cloud Composer

Reliability is a complex and important topic. I will focus on both reliability definition and best practices. I will begin by reviewing the Apache Airflow components that impact reliability. I will subsequently examine those aspects, showing the single points of failure, mitigations, and tradeoffs. The journey starts with the scheduling process. I will focus on the aspects of Scheduler infrastructure and configuration that address reliability improvements. It doesn’t run in a vacuum therefore I’ll share my observations on the reliability aspect of Scheduler infrastructure. We recommend tasks to be idempotent but that is not always possible. I will share the challenges of running user’s code in the distributed architecture of Cloud Composer. I will refer to the volatility of some cloud resources and mitigation methods in various scenarios. Deferrability plays important part in the reliability, but there are also other elements we shouldn’t ignore.

Unlocking the Power of Warehouse Allocation: Optimizing task dispatching for cost efficiency

2023-07-01

session

Ben Chen (Vestiaire Collective)

Airflow

In this session, we’ll explore the inner workings of our warehouse allocation service and its many benefits. We’ll discuss how you can integrate these principles into your own workflow and provide real-world examples of how this technology has improved our operations. From reducing queue times to making smart decisions about warehouse costs, warehouse allocation has helped us streamline our operations and drive growth. With its seamless integration with Airflow, building an in-house warehouse allocation pipeline is simple and can easily fit into your existing workflow. Join us for this session to unlock the full potential of this in-house service and take your operations to the next level. Whether you’re a data engineer or a business owner, this technology can help you improve your bottom line and streamline your operations. Don’t miss out on this opportunity to learn more and optimize your workflow with the warehouse allocation service.

Using Dynamic Task Mapping to Orchestrate dbt

2023-07-01

session

Pádraic Slattery

Airflow Analytics dbt SQL

Airflow, traditionally used by Data Engineers, is now popular among Analytics Engineers who aim to provide analysts with high-quality tooling while adhering to software engineering best practices. dbt, an open-source project that uses SQL to create data transformation pipelines, is one such tool. One approach to orchestrating dbt using Airflow is using dynamic task mapping to automatically create a task for each sub-directory inside dbt’s staging, intermediate, and marts directories. This enables analysts to write SQL code that is automatically added as a dedicated task in Airflow at runtime. Combining this new Airflow feature with dbt best practices offers several benefits, such as analysts not needing to make Airflow changes and engineers being able to re-run subsets of dbt models should errors occur. In this talk, I would like to share some lessons I have learned while successfully implementing this approach for several clients.

What Everybody Ought to Know About Airflow

2023-07-01

session

Marc Lamberti

Airflow API

Airflow is a powerful tool for orchestrating complex data workflows, which have undergone significant changes over the past two years. Since the Airflow release cycle has accelerated, you may struggle to keep up with the continuous flow of new features and improvements, which can lead to miss opportunities for addressing new use cases or solving your existing ones more efficiently. This presentation is intended to give you a solid update on the possibilities of Airflow and address misconceptions you may have heard or still believe that used to be valid but no longer are. At the end of this session, you will be able to use the essential features of Airflow, such as the Taskflow API, Datasets, or Dynamic Task Mapping, and you will know precisely what Airflow can and can’t do today. Fasten your seatbelt, take a deep breath, and let’s go 🚀

talk-data.com

Top Topics

Top Speakers

Reducing Cost with Async/Deferrable Operators

Reliable Airflow DAG Design When Building a Time-Series Data Lakehouse

Simplifying the Creation of Data Science Pipelines with Airflow

Sketching Pipelines Using DAG Authoring UI

Supercharge Your Data Testing with a Fully Open Stack

Supporting the Vast Airflow Community: Lessons learned from over 100 Airflow webinars

Talking With Management About Open Source

Testing Airflow DAGs with Dagtest

The Why and How of Running a Self-Managed Airflow on Kubernetes

Things to Consider When Building an Airflow Service

To Debug a DAG: The Airflow local dev story

Traps and Misconceptions of Running Reliable Workloads in Apache Airflow

Unlocking the Power of Warehouse Allocation: Optimizing task dispatching for cost efficiency

Using Dynamic Task Mapping to Orchestrate dbt

What Everybody Ought to Know About Airflow