We are continuing to modernize the Airflow UI to make it easier to manage all aspects of your DAGs. See a demo of the latest updates and improve your workflows with new tips and tricks. Then get a preview of what else will be coming soon. Followed up by Q&A for people to field their own use-cases and explore new ideas on how to improve the user experience.
talk-data.com
Activities tracked
81
Airflow Summit 2023 program
Top Topics
Sessions & talks
Showing 51–75 of 81 · Newest first
New to Airflow or haven’t followed any of the recent DAG authoring enhancements? This talk is for you! We will go through various DAG authoring features like Setup/Teardown tasks (~2.7), Datasets (2.4), Dynamic Tasks (2.3) and Async tasks (2.2). You won’t be an expert after this short talk, however, you’ll have a head start when you write your next DAG, no hacks required.
How to use Data Contracts for Data Quality in your Airflow Ecosystem
Data contracts have been much discussed in the community of late, with a lot of curiosity around how to approach this concept in practice. We believe data contracts need a harmonizing layer to manage data quality in a uniform manner across a fragmented stack. We are calling this harmonizing layer the Control Plane for Data - powered by the common thread across these systems: metadata. For teams already orchestrating pipelines with Airflow, data contacts can be an effective way to process data that meets preset quality standards. With a control plane as a connecting layer, producers can build data contracts that consumers can rely on, ensuring DAGs only run when a contract is valid. Producers can govern how workflows should behave, and consumers receive the tooling they need to only opt into high quality data. Learn how to use data contracts and DataHub to make your Airflow pipelines more reliable - as well as other use cases that can help build a simpler, more flexible data stack.
Introducing airflowctl: A CLI to streamline getting started with Airflow
New users starting with Airflow frequently encounter several challenges, ranging from the complexities of Containers and virtual environments to the Python dependency hell. Moreover, their familiarity with tools such as Docker, docker-compose, and Helm might be somewhat limited and even overkill. In contrast, seasoned Airflow users encounter their problems, encompassing configuration conflics with ongoing Airflow projects and intricacies stemming from Docker and docker-compose configurations and lack of visibility into all the projects. With airflowctl, users can install & setup Airflow using a single command. For existing users, they can use it to manage multiple Airflow projects with different Airflow versions on the same machine. This allows creating & debugging DAGs in an IDE seamlessly. Agenda for the call: Why airflowctl? Goal Current functionality & Demo Vision / Roadmap
Lightning talks and event wrap-up
We will have close Airflow Summit with lightning talks (5 minutes each). You will be able to sign up during the event. We will only have space for 10 talks.
Manifest destiny: Orchestrating dbt using Airflow
Airflow is a popular choice for organizations looking to integrate open-source dbt within their existing data infrastructure. This talk will explore two primary methods of running dbt in Airflow: job-level and model-level. We’ll discuss the tradeoffs associated with each approach, highlighting the simplicity and efficiency of job-level orchestration, contrasted with the enhanced observability and control provided by model-level orchestration. We’ll also explain how the balance has shifted in recent years, with improvements to dbt core making model-level more efficient and innovative Airflow extensions like Cosmos making it easier to implement. Finally, we’ll provide benchmarks to help you determine which paradigm is the best fit for your needs.
Apache Airflow has over 650 Python dependencies. In case you did not know already, dependencies in Python are difficult subject. And Airflow has its own, custom ways of managing the dependencies. Airflow has a rather complex system to manage dependencies in their CI system, but this talk is not about it. This talk is directed to the users of Airflow who want to keep their dependencies updated, describing ways they can do it. This presentation will explain how to effectively manage and handle custom dependencies in Airflow. Jarek will guide you through practical solutions and best practices to make your Airflow experience with dependencies - yes you guessed it - a breeze.
Micropipelines: A microservice approach for DAG authoring using datasets
Introduced in Airflow 2.4, Datasets are a foundational feature for authoring modular data pipelines. As DAGs grow to encompass a larger number of data sources and encompass multiple data transformation steps, they typically become less predictable in the timeliness of execution and less efficient. This talk focuses on leveraging Datasets to enable predictable and more efficient DAGs, by leveraging patterns from microservice architectures. Just as large monolithic applications were decomposed into micro-services to deliver more efficient scalability and faster development cycles, micropipelines have the same potential to radically transform data pipeline efficiency and development velocity. Using a simple financial analysis pipeline example, with data aggregation being done in Snowflake and prediction analysis in Spark, this talk outlines how to retain timelines of data pipelines while expanding data sets.
Migrate Apache Oozie Workflows to Airflow and Run with Amazon EMR
Learn how to convert Oozie Workflows into Airflow DAG and run it on Amazon EMR. The utility supports Airflow 2.4.3. This utility is built on top of https://github.com/GoogleCloudPlatform/oozie-to-airflow
Migrating from Enterprise Scheduler like Autosys, TIDAL, Stonebranch to Airflow
How we migrated from Autosys with 1000s of jobs with 800+ dependencies with SLA to be met every hour in a Canada Prominent Bank. Use case to migrate from enterprise scheduler $ spent for every license and renewal cost SLA,Monitoring,Auditing,Devops Integration Vendor lockin 4.Integration to multiple providers
This sesion is about the current state of implementation for multi-tenancy feature of Airflow. This is a long-term feature that involves multiple changes, separate AIPs to implement, with the long-term vision of having single Airflow instance supporting multiple, independed teams using it - either from the same company or as part of Airflow-As-A-Service implementation.
My Journey to Committer Status: What I learned and how it can help you
Apache Airflow is one of the largest Apache projects by many metrics but it ranks particularly high in the number of contributors involved in the project. This leads to hundreds of Github Issues, Pull Requests and Discussions being submitted to the project every month. So it is critical to have an ample number of Committers to support the community. In this talk I will summarize my personal experience working towards, and ultimately achieving, committer status in Apache Airflow. I’ll cover the lessons I learned along the way as well as provide some advice and best practices to help others achieve committer status themselves.
OpenLineage in Airflow: A Comprehensive Guide
With native support for OpenLineage in Airflow, users can now observe and manage their data pipelines with ease. This talk will cover the benefits of using OpenLineage, how it is implemented in Airflow, practical examples of how to take advantage of it, and what’s in our roadmap. Whether you’re an Airflow user or provider maintainer, this session will give you the knowledge to make the most of this tool.
Operators form the core of the language of Airflow. In this talk I will argue that while they have served their purpose, they are holding back the development of Airflow and if Airflow wants to stay relevant in the world of the ’new’ data stack (hint: it isn’t currently considered to be part of it) self-service data mesh it needs to kill its darling.
Open Source doc edits provide a low-stakes way for new users to first contribute. Ideally, new users find opportunities and feel welcome to fix docs as they learn, engaging with the community from the start. But, I found that contributing docs to Airflow had some surprising obstacles. In this talk, I’ll share my first docs contribution journey, including problems and fixes. For example, you must understand how Airflow uses Sphinx and know when to choose to edit in the GitHub UI or locally. But it wasn’t documented that GitHub renders only Markdown previews and since Sphinx uses markup, you must build docs locally to check formatting; an opportunity for me to add to the Contributor Guide for docs. In addition to examples of reducing obstacles, this talk covers the importance of docs for community and available resources to start writing. If you already contribute and want to create opportunities for others, I’ll also share characteristics of good first issues and docs projects.
Platform for Genomic Processing Using Airflow and ECS
High-scale orchestration of genomic algorithms using Airflow workflows, AWS Elastic Container Service (ECS), and Docker. Genomic algorithms are highly demanding of CPU, RAM, and storage. Our data science team requires a platform to facilitate the development and validation of proprietary algorithms. The Data engineering team develops a research data platform that enables Data Scientists to publish docker images to AWS ECR and run them using Airflow DAGS that provision AWS’s ECS compute power of EC2 and Fargate. We will describe a research platform that allows our data science team to check their algorithms on ~1000 cases in parallel using airflow UI and dynamic DAG generation to utilize EC2 machines, auto-scaling groups, and ECS clusters across multiple AWS regions.
Reducing Costs by Maximizing Airflow and DAG Performance
Airflow DAGs are Python code (which can pretty much do anything you want) and Airflow has hundreds configuration options (which can dramatically change Airflow behavior). Those two facts contribute to endless combinations that can run the same workloads, but only a precious few are efficient. The rest will result in failed tasks and excessive compute usage, costing time and money. This talk will demonstrate how small changes can yield big dividends, and reveals some code improvements and Airflow configurations that can reduce costs and maximize performance.
At Condé Nast, we have heavily leveraged async/deferrable operators to reduce our Airflow-associated costs. By implementing async/deferrable operators in all of our pipelines, we have been able to realize a cost reduction of 54% compared with our previous usage of non-async/deferrable operators.
Reliable Airflow DAG Design When Building a Time-Series Data Lakehouse
As a team that has built a Time-Series Data Lakehouse at Bloomberg, we looked for a workflow orchestration tool that could address our growing scheduling requirements. We needed a tool that was reliable and scalable, but also could alert on failures and delays to enable users to recover quickly from them. From using triggers over simple sensors to implementing custom SLA monitoring operators, we explore our choices in designing Airflow DAGs to create a reliable data delivery pipeline that is optimized for failure detection and remediation.
Simplifying the Creation of Data Science Pipelines with Airflow
The ability to create DAGs programmatically opens up new possibilities for collaboration between Data Science and Data Engineering. Engineering and DevOPs are typically incentivized by stability whereas Data Science is typically incentivized by fast iteration and experimentation. With Airflow, it becomes possible for engineers to create tools that allow Data Scientists and Analysts to create robust no-code/low-code data pipelines for feature stores. We will discuss Airlow as a means of bridging the gap between data infrastructure and modeling iteration as well as examine how a Qbiz customer did just this by creating a tool which allows Data Scientists to build features, train models and measure performance, using cloud services, in parallel.
Sketching Pipelines Using DAG Authoring UI
Cloudera Data Engineering (CDE) is a serverless service for Cloudera Data Platform that allows you to submit various Spark jobs and Airflow DAGs to an auto-scaling cluster. Running your workloads as Python DAG files may be the usual, but not the most convenient way for some users as it involves a lot of background around syntaxes, the programming language, aesthetics of Airflow, etc. The DAG Authoring UI is a tool built on top of Airflow APIs to allow one to use a graphical user interface to create, manage, and destroy complex DAGs. The DAG authoring UI will give one the ability to perform tasks on Airflow without really having to know DAG structure, Python programming language, and the internals of Airflow. CDE has identified multiple operators to perform various tasks on Airflow by carefully categorising the use cases. The operators range from BashOperator, PythonOperator, CDEJobRunOperator, CDWJobRunOperator Most use cases can be run as combinations of the operators provided.
Are you tired of spending countless hours testing your data pipelines, only to find that they don’t work as expected? Do you wish there was a better way to manage your data versions and streamline your testing processes? If so, this presentation is for you! Join us as we explore the problem domain of testing environments for data pipelines and take a deep dive into the available tools currently in use. We’ll introduce you to the game-changing concepts of data versioning and lakeFS and show you how to integrate these tools with Airflow to revolutionize your testing workflows. But don’t just take our word for it - witness this power firsthand with a live demo of testing an Airflow DAG on a snapshot of production data, providing a practical demo of the tools and techniques covered in the presentation. So don’t miss out on this opportunity to supercharge your testing processes and take your data pipelines to the next level.
Supporting the Vast Airflow Community: Lessons learned from over 100 Airflow webinars
Astronomer has hosted over 100 Airflow webinars designed to educate and inform the community on best practices, use cases, and new features. The goal of these events is to increase Airflow’s adoption and ensure everybody, from new users to experienced power users, can keep up with a project that is evolving faster than ever. When new releases come out every few months, it can be easy to get stuck in past versions of Airflow. Instead, we want existing users to know how new features can make their lives easier, new users to know that Airflow can support their use case, and everybody to know how to implement the features they need and get them to production. This talk will cover some of the key learnings we’ve gathered from 2.5 years of conducting webinars aimed at supporting the community in growing their Airflow use, including how to best cater DevRel efforts to the many different types of Airflow users and how to effectively push for the adoption of new Airflow features.
Testing Airflow DAGs with Dagtest
For the dag owner, testing Airflow DAGs can be complicated and tedious. Kubectl cp your dag from local to pod, exec into the pod, and run a command? Install breeze? Why pull the Airflow image and start up the webserver / scheduler / triggerer if all we want is to test the addition of a new task? It doesn’t have to be this hard. At Etsy, we’ve simplified testing dags for the dag owner with dagtest. Dagtest is a Python package that we house on our internal PyPi. It is a small client binary that makes HTTP requests to a test API. The test API is a simple Flask server that receives these requests and builds pods to run airflow dags backfill commands based on the options provided via dagtest. The simplest of these is a dry-run. Typically, users run test runs where the dag executes E2E for a single ds. Equally important is the environment setup. We use an adhoc Airflow instance in a separate GCP environment with an SA that cannot write to Production buckets. This talk will discuss both.
The Why and How of Running a Self-Managed Airflow on Kubernetes
Today, all major cloud service providers and 3rd party providers include Apache Airflow as a managed service offering in their portfolios. While these cloud based solutions help with the undifferentiated heavy lifting of environment management, some data teams are also looking to operate self-managed Airflow instances to satisfy specific differentiated capabilities. In this session, we would talk about: Why should you might need to run self managed Airflow The available deployment options (with emphasis on Airflow on Kubernetes) How to deploy Airflow on Kubernetes using automation (Helm Charts & Terraform) Developer experience (sync DAGs using automation) Operator experience (Observability) Owned responsibilities and Tradeoffs A thorough understanding would help you understand the end-to-end perspectives of operating a highly available and scalable self managed Airflow environment to meet your ever growing workflow needs.