Airflow Summit 2024

How the Airflow Community Productionizes Generative AI

2024-07-01

session

Pete DeJoy (Astronomer)

AI/ML Airflow Data Engineering Data Quality GenAI LLM

Every data team out there is being asked from their business stakeholders about Generative AI. Taking LLM centric workloads to production is not a trivial task. At the foundational level, there are a set of challenges around data delivery, data quality, and data ingestion that mirror traditional data engineering problems. Once you’re past those, there’s a set of challenges related to the underlying use case you’re trying to solve. Thankfully, because of how Airflow was already being used at these companies for data engineering and MLOps use cases, it has become the defacto orchestration layer behind many GenAI use cases for startups and Fortune 500s. This talk will be a tour of various methods, best practices, and considerations used in the Airflow community when taking GenAI use cases to production. We’ll focus on 4 primary use cases; RAG, fine tuning, resource management, and batch inference and take a walk through patterns different members in the community have used to productionize this new, exciting technology.

How we Run 100 Airflow Environments and Millions of Tasks as a Part Time Job Using Kubernetes

2024-07-01

session

Michael Juster

Airflow Kubernetes

Balyasny Asset Management (BAM) is a diversified global investment firm founded in 2001 with over $20 billion in assets under management. We have more than 100 teams who run a variety of workloads that benefit from Orchestration and parallelization. Platform Engineers working for companies with K8s ecosystems can use their Kubernetes knowledge and leverage their platform to run Airflow and troubleshoot problems successfully. BAM’s Kubernetes Platform provides production-ready Airflow environments that automatically get Logging, Metrics, Alerting, Scalability, Storage from a range of File Systems, Authentication, Dashboards, Secrets Management, and specialized compute including GPU, CPU Optimized, Memory Optimized and even Windows. If you can run thousands of Pods on your Kubernetes Cluster then you can run thousands of Tasks without needing to do anything! The intention of this talk is to cover: Why K8s and Airflow work so well together How a team of Platform Engineers can leverage their Kubernetes Platform and knowledge to run millions of Tasks without Airflow being their primary focus Examples of where this model can start to fall apart at extreme scale

How we Tuned our Airflow to Make 1.2 million DAG Runs - per day!

2024-07-01

session

Jens Scheffler

Airflow

As we deployed Airflow in our enterprise connected to various event sources to implement our data-driven pipelines we were faced with event storms a couple of times. As of such event storms happened often unplanned and with increased load waves we iteratively tuned the setup in multiple iterations. We were in panic and also needed to add some quick workarounds sometime. Starting from a peak of 1000 triggers in a hour we were happy that workload just queued. But at a certain point we started tuning the setup. With about 10-20 iterations which we would like to share as best practice we started tuning standard parameters, increased resources, changed integration strategies as well and developed patches to core scheduler. This talk is a retro of the steps we did to share about options to tune and strategies to scale. Being afraid of a queue which degraded performance when having 10000 runs to a peak event reception of 400k runs in an hour it was a long way. You also might hear about some anti-patterns as learning.

How we use Airflow at Booking to Orchestrate Big Data Workflows

2024-07-01

session

Mayank Chopra , Madhav Khakhar (Booking.com) , Alexander Shmidt

Airflow Astronomer Big Data Kubernetes YAML

The talk will cover how we use Airflow at the heart of our Workflow Management Platform(WFM) at Booking.com, enabling our internal users to orchestrate big data workflows on Booking Data Exchange(BDX). High level overview of the talk: Adapting open source Airflow helm chart to spin up Airflow installation in Booking Kubernetes Service (BKS) Coming up with Workflow definition format (yaml) Conversion of workflow.yaml to workflow.py DAGs Usage of Deferrable operators to provide standard step templates to users Workspaces (collection of workflows), using it to ensure role based access to DAG permissions for users Using okta for authentication Alerting, monitoring, logging Plans to shift to Astronomer

Hybrid Executors: Have Your Cake and Eat it Too

2024-07-01

session

Niko Oliveira (Amazon | Apache Airflow Comitter)

Airflow

Executors are a core concept in Apache Airflow and they are an essential piece to the execution of DAGs. They continue to see investment and innovation including a new feature launching this year: Hybrid Execution. This talk will give a brief overview of executors, how they work and what they are responsible for. Followed by a description of Hybrid Executors (AIP-61), a new feature to allow multiple executors to be used natively and seamlessly side by side within a single Airflow environment. We’ll deep dive into how this feature works, how users can make use of it, compare this new feature to what was available before, and finally a demo to see it in action. Don’t miss this chance to learn about the cutting edge capabilities of executors in Apache Airflow!

Integrating dbt with Airflow: Overcoming performance hurdles

2024-07-01

session

Tatiana Al-Chueyr Martins , Pankaj Koti

Airflow Astronomer Cosmos dbt GitHub

The integration between dbt and Airflow is a popular topic in the community, both in previous editions of Airflow Summit, in Coalesce and the #airflow-dbt Slack channel. Astronomer Cosmos ( https://github.com/astronomer/astronomer-cosmos/ ) stands out as one of the libraries that strives to enhance this integration, having over 300k downloads per month. During its development, we’ve encountered various performance challenges in terms of scheduling and task execution. While we’ve managed to address some, others remain to be resolved. This talk describes how Cosmos works, the improvements made over the last 1.5 years, and the roadmap. It also aims to collect feedback from the community on how we can further improve the experience of running dbt in Airflow.

Investigating the Many Loops of the Airflow Scheduler

2024-07-01

session

Philippe Gagnon

Airflow

The scheduler is unarguably the most important component of an Airflow cluster. It is also the most complex and misunderstood by practitioners and administrators alike. In this talk, we will follow the path that a task instance takes to progress from creation to execution, and discuss the various configuration settings allowing users to tune the scheduler and executor to suit their workload patterns. Finally, we will dive deep into critical sections of the Airflow codebase and explore opportunities for optimization.

Lessons from the Ecosystem: What can Airflow Learn from Other Open-source Communities?

2024-07-01

session

Michael Robinson

Airflow Analytics

The Apache Airflow community is so large and active that it’s tempting to take the view that “if it ain’t broke don’t fix it.” In a community as in a codebase, however, improvement and attention are essential to sustaining growth. And bugs are just as inevitable in community management as they are in software development. If only the fixes were, too! Airflow is large and growing because users love Airflow and our community. But what steps could be taken to enhance the typical user’s and developer’s experience of the community? This talk will provide an overview of potential learnings for Airflow community management efforts, such as project governance and analytics, derived from the speaker’s experience managing the OpenLineage and Marquez open-source communities. The talk will answer questions such as: What can we learn from other open-source communities when it comes to supporting users and developers and learning from them? For example, what options exist for getting historical data out of Slack despite the limitations of the free tier? What tools can be used to make adoption metrics more reliable? What are some effective supplements to asynchronous governance?

Linkedin's Continuous Deployment

2024-07-01

session

Keshav Tyagi (LinkedIn) , Rahul Gade

Airflow

LinkedIn Continuous Deployment (LCD), started with the goal of improving the deployment experience and expanding its outreach to all LinkedIn systems. LCD delivers a modern deployment UX and easy-to-customize pipelines which enables all LinkedIn applications to declare their deployment pipelines. LCD’s vision is to automate cluster provisioning, deployments and enable touchless (continuous) deployments while reducing the manual toil involved in deployments. LCD is powered by Airflow to orchestrate its deployment pipelines and automate the validation steps. For our customers Airflow is an implementation detail and we have well abstracted it out with our no-code/low code pipelines. Users describe their pipeline intent (via CLI/UI) and LCD translates the pipeline intent into Airflow DAGs. LCD pipelines are built of steps. Inorder to democratize the adoption of the LCD, we have leveraged K8sPodOperator to run steps inside the pipeline. LCD partner teams expose validation actions as containers, which LCD pipeline runs as steps. At full scale, LCD will have about 10K+ DAGs running in parallel.

LLMs for Software Development & Apache Airflow

2024-07-01

session

Danny Tarlow (Google DeepMind)

AI/ML Airflow LLM Python

Artificial Intelligence is reshaping the landscape of software development. In this talk, we’ll explore the latest AI breakthroughs improving LLM capabilities for software development use cases. We’ll discuss work and ideas in the field related to Airflow, particularly around model capabilities related to Python, DSLs, and low-resource languages.

Managing version upgrades without feelings of terror

2024-07-01

session

Daniel Standish

Airflow CI/CD MySQL

Airflow version upgrades can be challenging. Maybe you upgrade and your dags fail to parse (that’s an easy fix). Or maybe you upgrade and everything looks fine, but when your dag runs, you can no longer connect to mysql because the TLS version changed. In this talk I will provide concrete strategies that users can put into practice to make version upgrades safer and less painful. Topics may include: What semver means and what it implies for the upgrade process Using integration test dags, unit tests, and a test cluster to smoke out problems Strategies around constraints files / pinning, and managing providers vs core versions Using db clean prior to upgrade to reduce table size Rollback strategies What to do about warnings (e.g. deprecation warnings)? I’ll also focus on keeping it simple. Sometimes things like “integration tests” and “CI” can be scary for people. Even without having set up anything automated, there are still things you can do to make management of upgrades a little less painful and risky.

Mastering Advanced Dataset Scheduling in Apache Airflow

2024-07-01

session

Ankit Chaurasia

Airflow API

Are you looking to harness the full potential of data-driven pipelines with Apache Airflow? This session will dive into the newly introduced conditional expressions for advanced dataset scheduling in Airflow - a feature highly requested by the Airflow community. Attendees will learn how to effectively use logical operators to create complex dependencies that trigger DAGs based on the dataset updates in real-world scenarios. We’ll also explore the innovative DatasetOrTimeSchedule, which combines time-based and dataset-triggered scheduling for unparalleled flexibility. Furthermore, attendees will discover the latest API endpoints that facilitate external updates and resets of dataset events, streamlining workflow management across different deployments. This talk also aims to explain: The basics of using conditional expressions for dataset scheduling. How do we integrate time-based schedules with dataset triggers? Practical applications of the new API endpoints for enhanced dataset management. Real-world examples of how these features can optimize your data workflows.

OpenLineage: From Operators to Hooks

2024-07-01

session

Maciej Obuchowski (Datadog)

Airflow API

“More data lineage” has been second most popular feature request in Airflow Survey 2023. However, despite the integration of OpenLineage in Airflow 2.7 through AIP-53, the most popular Operator in Airflow - PythonOperator - isn’t covered by lineage support. With addition of TaskFlow API, Airflow Datasets, Airflow ObjectStore, and many other small changes, writing DAGs without using other operators is easier than ever. And that’s why lineage collection in Airflow moves beyond covering specific Operators, to covering Hooks and Object Storage. In this session, you’ll learn how newly added AIP-62 will allow you author DAGs the way you love, while also keeping benefits of a data pipeline well covered by lineage.

Optimize Your DAGs: Embrace Dag Params for Efficiency and Simplicity

2024-07-01

session

Sumit Maheshwari

Data Engineering

In the realm of data engineering, there is a prevalent tendency for professionals to develop similar Directed Acyclic Graphs (DAGs) to manage analogous tasks. Leveraging Dag Params presents an effective strategy for mitigating redundancy within these DAGs. Moreover, the utilization of Dag Params facilitates seamless enforcement of user inputs, thereby streamlining the process of incorporating validations into the DAG codebase.

Optimizing Airflow Performance: Strategies, Techniques, and Best Practices

2024-07-01

session

Pankaj Singh , Pankaj Koti

Airflow

Airflow, an open-source platform for orchestrating complex data workflows, is widely adopted for its flexibility and scalability. However, as workflows grow in complexity and scale, optimizing Airflow performance becomes crucial for efficient execution and resource utilization. This session delves into the importance of optimizing Airflow performance and provides strategies, techniques, and best practices to enhance workflow execution speed, reduce resource consumption, and improve system efficiency. Attendees will gain insights into identifying performance bottlenecks, fine-tuning workflow configurations, leveraging advanced features, and implementing optimization strategies to maximize pipeline throughput. Whether you’re a seasoned Airflow user or just getting started, this session equips you with the knowledge and tools needed to optimize your Airflow deployments for optimal performance and scalability. We’ll also explore topics such as DAG writing best practices, monitoring and updating Airflow configurations, and database performance optimization, covering unused indexes, missing indexes, and minimizing table and index bloat.

Optimizing Critical Operations: Enhancing Robinhood's Workflow Journey with Airflow

2024-07-01

session

Palanieppan Muthiah , Peiqiu Tian , Kevin Wang

Airflow Analytics Data Lake Kubernetes

Airflow is widely used within Robinhood. In addition to traditional offline analytics use cases (to schedule ingestion and analytics workloads that populate our data lake), we also use Airflow in our backend services to orchestrate various workflows that are highly critical for the business, e.g: compliance and regulatory reporting, user facing reports and more. As part of this, we have evolved what we believe is a unique deployment architecture for Airflow. We have central schedulers that are responsible for workloads from multiple different teams, but the workflow tasks themselves run on workers owned by respective teams that are highly coupled with their backend services and codebase. Furthermore, Robinhood augmented Airflow with a bunch of customizations — airflow worker template for Kubernetes, enhanced observability, enhanced SLA detection, and a collection of operators, sensors, and plugins to tailor Airflow to their exact needs. This session is going to walk through how we grew our architecture and adapted Airflow to fit Robinhood’s variety of needs and use cases.

Orchestrating & Optimizing a Batch Ingestion Data Platform for Americas #1 Sportsbook

2024-07-01

session

Gunnar Lykins

FanDuel Group, an industry leader in sports-tech entertainment, is proud to be recognized as the #1 sports betting company in the US as of 2023 with 53.4% market share. With a workforce exceeding 4,000 employees, including over 100 data engineers, FanDuel Group is at the forefront of innovation in batch processing orchestration platforms. Currently, our platform handles over 250,000 DAG runs & executes ~3 million tasks monthly across 17 deployments. It provides a standardized framework for pipeline development, structured observability, monitoring, & alerting. It also offers automated data processing managed by an in-house team, enabling stakeholders to concentrate on core business objectives. Our batch ingestion platform is the backbone of endless use cases, facilitating the landing of data into storage at scheduled intervals, real-time ingestion of micro batches triggered by events, standardization processes, & ensuring data availability for downstream applications. Our proposed session also delves into our forward-looking tech strategy as well as addressing the expansion of orchestration diversity by integrating scheduled jobs from various domains into our robust data platform.

Orchestration of ML workloads via Airflow & GKE Batch

2024-07-01

session

Rafal Biegacz , Piotr Leśniak

AI/ML Airflow API Kubernetes

During this talk we are going to given an overview of different orchestration approaches (Kubeflow, Ray, Airflow, etc.) when running ML workloads on Kubernetes and specifically we will focus on how to use Kubernetes Batch API and Kubernetes Operators to run complex ML workloads.

Overcoming Custom Python Package Hurdles in Airflow

2024-07-01

session

Shubham Raj , Amogh Rajesh Desai

Airflow Cloud Computing Docker Python

DAG Authors, while constructing DAGs, generally use native libraries provided by Airflow in conjunction with python libraries available over public PyPI repositories. But sometimes, DAG authors need to construct DAG using libraries that are either in-house or not available over public PyPI repositories. This poses a serious challenge for users who want to run their custom code with Airflow DAGs, particularly when Airflow is deployed in a cloud-native fashion. Traditionally, these packages are baked in Airflow Docker images. This won’t work post deployment and is super impractical if your library is under development. We propose a solution that creates a dedicated Airflow global python environment that dynamically generates the requirements, establishes a version-compatible pyenv adhering to Airflow’s policies, and manages custom pip repository authentication seamlessly. Importantly, the service executes these steps in a fail-safe manner, not compromising core components. Join us as we discuss the solution to this common problem, touching upon the design, and seeing the solution in action. We also candidly discuss some challenges, and the shortcomings of the proposed solution.

Product Management perspective on Data Observability with Databand

2024-07-01

session

Steve Sawyer

AI/ML Data Quality IBM Fabric

In this session Steve Sawyer will discuss a case study for how IBM Data Observability with Databand, collects metadata to build historical baselines, detect anomalies and triage alerts to remediate data quality issues for you data pipelines and warehouses. Additionally, he will provide a Product perspective on the technologies IBM is building to meet the data observability needs across the enterprise, and how it relates to our investments in AI and Data Fabric.

Profiling Airflow tasks with Memray

2024-07-01

session

Cedrik Neumann

Airflow

Profiling Airflow tasks can be difficult, specially in remote environments. In this talk I will demonstrate how we can leverage the capabilities of Airflow’s plugin mechanism to selectively run Airflow tasks within the context of a profiler and with the help of operator links and custom views make the results available to the user. The content of this talk can provide inspiration on how Airflow may in the future allow the gathering of custom task metrics and make those metrics easily accessible.

Refactoring DAGs: From Duplication to Delightful Efficiency with a Centralized Library

2024-07-01

session

Gil Reich

Airflow

Feeling trapped in a maze of duplicate Airflow DAG code? We were too! That’s why we embarked on a journey to build a centralized library, eliminating redundancy and unlocking delightful efficiency. Join us as we share: The struggles of managing repetitive code across DAGs Our approach to a centralized library, revealing design and implementation strategies The amazing results: reduced development time, clean code, effortless maintenance, and a framework that creates efficient and self-documenting DAGs Let’s break free from complexity and duplication, and build a brighter Airflow future together!

Running Airflow Tasks Anywhere, in any Language

2024-07-01

session

Vikram Koka (Astronomer) , Ash Berlin-Taylor

Airflow Rust Spark

Imagine a world where writing Airflow tasks in languages like Go, R, Julia, or maybe even Rust is not just a dream but a native capability. Say goodbye to BashOperators; welcome to the future of Airflow task execution. Here’s what you can expect to learn from this session: Multilingual Tasks: Explore how we empower DAG authors to write tasks in any language while retaining seamless access to Airflow Variables and Connections. Simplified Development and Testing: Discover how a standardized interface for task execution promises to streamline development efforts and elevate code maintainability. Enhanced Scalability and Remote Workers: Learn how enabling tasks to run on remote workers opens up possibilities for seamless deployment on diverse platforms, including Windows and remote Spark or Ray clusters. Experience the convenience of effortless deployments as we unlock new avenues for Airflow usage. Join us as we embark on an exploratory journey to shape the future of Airflow task execution. Your insights and contributions are invaluable as we refine this vision together. Let’s chart a course towards a more versatile, efficient, and accessible Airflow ecosystem.

Scalable Development of Event Driven Airflow DAGs

2024-07-01

session

Ipsa Trivedi , Subramanian Vellaiyan

Airflow Spark Data Streaming

This usecase shows how we deal with data of different varieties from different sources. Each source sends data in different layout, timings, structures, location patterns sizes. The goal is to process the files within SLA and send them out. This a complex multi step processing pipeline that involves multiple spark jobs, api based integrations with microservices, resolving unique ids, deduplication and filtering. Note that this is an event driven system, but not a streaming data system. The files are of gigabyte scale, and each day the data being processed is of terabyte scale. We will be talking about how to make DAG creation and business logic building a “low-code no-code process” so that non technical analysts can write business logic and light developers can deploy DAGs without much manual effort. Every aspect is either source specific or source-agnostic configuration driven. Airflow was chosen to enable easy DAG building, scaling, monitoring, troubleshooting and rerunning.

Scale and Security: How Autodesk Securely Develops and Tests PII Pipelines with Airflow

2024-07-01

session

Bhavesh Jaisinghani

Airflow Astronomer Big Data Cloud Computing Cyber Security Spark

In today’s data-driven era, ensuring data reliability and enhancing our testing and development capabilities are paramount. Local unit testing has its merits but falls short when dealing with the volume of big data. One major challenge is running Spark jobs pre-deployment to ensure they produce expected results and handle production-level data volumes. In this talk, we will discuss how Autodesk leveraged Astronomer to improve pipeline development. We’ll explore how it addresses challenges with sensitive and large data sets that cannot be transferred to local machines or non-production environments. Additionally, we’ll cover how this approach supports over 10 engineers working simultaneously on different feature branches within the same repo. We will highlight the benefits, such as conflict-free development and testing, and eliminating concerns about data corruption when running DAGs on production Airflow servers. Join me to discover how solutions like Astronomer empower developers to work with increased efficiency and reliability. This talk is perfect for those interested in big data, cloud solutions, and innovative development practices.

talk-data.com

Top Topics

Top Speakers

How the Airflow Community Productionizes Generative AI

How we Run 100 Airflow Environments and Millions of Tasks as a Part Time Job Using Kubernetes

How we Tuned our Airflow to Make 1.2 million DAG Runs - per day!

How we use Airflow at Booking to Orchestrate Big Data Workflows

Hybrid Executors: Have Your Cake and Eat it Too

Integrating dbt with Airflow: Overcoming performance hurdles

Investigating the Many Loops of the Airflow Scheduler

Lessons from the Ecosystem: What can Airflow Learn from Other Open-source Communities?

Linkedin's Continuous Deployment

LLMs for Software Development & Apache Airflow

Managing version upgrades without feelings of terror

Mastering Advanced Dataset Scheduling in Apache Airflow

OpenLineage: From Operators to Hooks

Optimize Your DAGs: Embrace Dag Params for Efficiency and Simplicity

Optimizing Airflow Performance: Strategies, Techniques, and Best Practices

Optimizing Critical Operations: Enhancing Robinhood's Workflow Journey with Airflow

Orchestrating & Optimizing a Batch Ingestion Data Platform for Americas #1 Sportsbook

Orchestration of ML workloads via Airflow & GKE Batch

Overcoming Custom Python Package Hurdles in Airflow

Product Management perspective on Data Observability with Databand

Profiling Airflow tasks with Memray

Refactoring DAGs: From Duplication to Delightful Efficiency with a Centralized Library

Running Airflow Tasks Anywhere, in any Language

Scalable Development of Event Driven Airflow DAGs

Scale and Security: How Autodesk Securely Develops and Tests PII Pipelines with Airflow