Airflow Summit 2025

Simplifying Data Management with DAG Factory

2025-07-01

session

Katarzyna Kałek , Jakub Orłowski

Airflow AWS Data Governance Data Management

At OLX, we connect millions of people daily through our online marketplace while relying on robust data pipelines. In this talk, we explore how the DAG Factory concept elevates data governance, lineage, and discovery by centralizing operator logic and restricting direct DAG creation. This approach enforces code quality, optimizes resources, maintains infrastructure hygiene and enables smooth version upgrades. We then leverage consistent naming conventions in Airflow to build targeted namespaces, aligning teams with global policies while preserving autonomy. Integrating external tools like AWS Lake Formation and Open Metadata further unifies governance, making it straightforward to manage and secure data. This is critical when handling hundreds or even thousands of active DAGs. If the idea of storing 1,600 pipelines in one folder seems overwhelming, join us to learn how the DAG Factory concept simplifies pipeline management. We’ll also share insights from OLX, highlighting how thoughtful design fosters oversight, efficiency, and discoverability across diverse use cases.

Task failures troubleshooting based on Airflow & Kubernetes signals

2025-07-01

session

Khadija Al Ahyane

Airflow Kubernetes

Per Airflow community survey, Kubernetes is the most popular compute platform used to run Airflow and when run on Kubernetes, Airflow gains, out of the box, lots of benefits like monitoring, reliability, ease of deployment, scalability and autoscaling. On the other hand, running Airflow on Kubernetes means running a sophisticated distributed system on another distributed system which makes troubleshooting of Airflow tasks and DAGs failures harder. This session tackles that bottleneck head-on, introducing a practical approach to building an automated diagnostic pipeline for Airflow on Kubernetes. Imagine offloading tedious investigations to a system that, on task failure, automatically collects and correlates key signals from Kubernetes components (linking Airflow tasks to specific Pods and their events), KubernetesGKE monitoring, and relevant logs—pinpointing root causes and suggesting actionable fixes. Attendees will leave with a clear understanding of common Airflow-on-Kubernetes failure patterns—and more importantly, a blueprint and practical strategies to reduce MTTR and boost team efficiency.

The Secret to Airflow's Evergreen Build: CI/CD magic

2025-07-01

session

Jarek Potiuk , Pavan kumar Gopidesu , Amogh Rajesh Desai

Airflow API CI/CD GitHub

Have you ever wondered why Apache Airflow builds are asymptotically() green? That thrive for “perennial green build” is not magic, it’s the result of continuous, often unseen engineering effort within our CI/CD pipelines & dev environments. This dedication ensures that maintainers can work efficiently & contributors can onboard smoothly. To tackle the ever growing contributor base, we have a CI/CD team run by volunteers putting in significant work in the foundational tooling. In this talk, we reveal some innovative solutions we have implemented like: Handling GitHub Actions pull_request_target challenges Restructuring the repo for better clarity Slack bot for CI failure alerts A cherry picker workflow for releases Pre-commit hooks Faster website and image builds Tackling the new GitHub API rate limits Solving chicken-and-egg build issues during releases Join us to understand the “why” & “how” behind these infra components. You’ll gain insights into the continuous effort required to support a thriving open-source project like Airflow and, hopefully, be inspired to contribute to these areas. () asymptotically = we fix failures as quickly as we can when they happen

Transforming Data Engineering: Achieving Efficiency and Ease with an Intuitive Orchestration Solution

2025-07-01

session

Mili Tripathi , Rakesh Kumar Tai

Airflow CI/CD Data Engineering Data Science Git PySpark

In the rapidly evolving field of data engineering and data science, efficiency and ease of use are crucial. Our innovative solution offers a user-friendly interface to manage and schedule custom PySpark, PySQL, Python, and SQL code, streamlining the process from development to production. Using Airflow at the backend, this tool eliminates the complexities of infrastructure management, version control, CI/CD processes, and workflow orchestration.The intuitive UI allows users to upload code, configure job parameters, and set schedules effortlessly, without the need for additional scripting or coding. Additionally, users have the flexibility to bring their own custom artifactory solution and run their code. In summary, our solution significantly enhances the orchestration and scheduling of custom code, breaking down traditional barriers and empowering organizations to maximize their data’s potential and drive innovation efficiently. Whether you are an individual data scientist or part of a large data engineering team, this tool provides the resources needed to streamline your workflow and achieve your goals faster than ever before.

UI Office Hours

2025-07-01

session

Brent Bovenzi

Airflow

Join this live demo of the Airflow UI while we answer your questions and discuss your ideas on how to improve the experience.

Unleash Airflow's Potential with hands-on Performance Optimization

2025-07-01

session

Mike Ellis

Airflow

This interactive workshop session empowers you to unlock the full potential of Apache Airflow through performance optimization techniques. Gain hands-on experience identifying performance bottlenecks and implementing best practices to overcome them.

Unlocking Event-Driven Scheduling in Airflow 3: A New Era of Reactive Data Pipelines

2025-07-01

session

Vincent Beck

Airflow API Pub/Sub

Airflow 3 introduces a major evolution in orchestration: native support for external event-driven scheduling. In this talk, I’ll share the journey behind AIP-82—why we needed it, how we built it, and what it unlocks. I’ll dive into how the new AssetWatcher enables pipelines to respond immediately to events like file arrivals, API calls, or pub/sub messages. You’ll see how this drastically reduces latency and infrastructure overhead while improving reactivity and resource efficiency. We’ll explore how it works under the hood, real-world use cases, best practices, and migration tips for teams ready to shift from time-based to event-driven workflows. If you’re looking to make your Airflow DAGs more dynamic, this is the talk that shows you how. Whether you’re an operator or contributor, you’ll walk away with a deep understanding of one of Airflow 3’s most impactful features.

Using Airflow for Real-Time Data Processing at Scale: Architecture, Challenges & Wins

2025-07-01

session

Vishvesh Pandey

Airflow API

Airflow is a powerhouse for batch data pipelines—but can it be tuned for real-time workloads? In this session, we’ll share how we adapted Apache Airflow to orchestrate near-real-time data processing at scale. From leveraging event-driven triggers and external APIs to minimizing latency with smart DAG design, we’ll dive into real-world architectural patterns, challenges, and optimizations that helped us handle time-sensitive data workflows with confidence. This talk is ideal for teams seeking to expand beyond batch and explore hybrid or real-time orchestration using Airflow.

Using Apache Airflow with Trino for (almost) all your data problems

2025-07-01

session

Philippe Gagnon

Airflow Analytics Trino

Trino is incredibly effective at enabling users to extract insights quickly and effectively from large amount of data located in dispersed and heterogeneous federated data systems. However, some business data problems are more complex than interactive analytics use cases, and are best broken down into a sequence of interdependent steps, a.k.a. a workflow. For these use cases, dedicated software is often required in order to schedule and manage these processes with a principled approach. In this session, we will look at how we can leverage Apache Airflow to orchestrate Trino queries into complex workflows that solve practical batch processing problems, all the while avoiding the use of repetitive, redundant data movement.

Vayu: The Airflow Copilot

2025-07-01

session

Sanchit Sreekanth , Muhammed Irshad

AI/ML Airflow Git LLM

Vayu is a conversational copilot for Apache Airflow, developed at Prevalent AI to help data engineers manage, troubleshoot, and fix pipelines using natural language. Deployments often fail silently due to misconfigurations, missing connections, or runtime issues impossible to identify in unit tests. Vayu tackles these via a troubleshooting agent that inspects logs, metrics, configs, and runtime state to find root causes and suggest fixes saving engineers significant troubleshooting time. It can also apply approved fixes to DAG code and commit them to your version control system. Key Capabilities: Troubleshooting Agent: Inspects logs, configs, variables, and connections to find root causes and suggest fixes. Pipeline Mechanic Agent: Suggests code-level fixes e.g., missing connections or bad imports and, once approved, commits them to version control. DAG Manager Agent: Understands DAG logic, suggests improvements, and can trigger DAGs conversationally. Architecture: Built with open-source tools including Google ADK as the orchestration layer and a custom Airflow MCP server based on the FastMCP framework. LLMs never access Airflow directly. The full codebase will be open-sourced.

Why AWS chose Apache Airflow to power workflows for the next generation of Amazon SageMaker

2025-07-01

session

John Jackson , Kamen Sharlandjiev

AI/ML Airflow Analytics AWS GenAI Amazon SageMaker

On March 13th, 2025, Amazon Web Services announced General Availability of Amazon SageMaker Unified Studio, bringing together AWS machine learning and analytics capabilities. At the heart of this next generation of Amazon SageMaker sits Apache Airflow. All SageMaker Unified Studio users have a personal, open-source Airflow deployment, running alongside their Jupyter notebook, enabling those users to easily develop Airflow DAGs that have unified access to all of their data. In this talk, I will go into details around the motivations for choosing Airflow for this capability, the challenges with incorporating Airflow into such a large and diverse experience, the key role that open-source plays, how we’re leveraging GenAI to make that open source development experience better, and the goals for the future of Airflow in SageMaker Unified Studio. Attendees will leave with a better understanding of the considerations they need to make when choosing Airflow as a component of their enterprise project, and a greater appreciation of how Airflow can power advanced capabilities.

Why Datadog Chose Airflow 3: Multi-Tenancy, Observability, and the Future of Event-Driven Workflows

2025-07-01

session

Julien Le Dem , Zach Gottesman

Airflow Datadog Luigi

Datadog is a world-class data platform ingesting more than a 100 trillion events a day, providing real-time insights. Before Airflow’s prominence, we built batch processing on Luigi, Spotify’s open-source orchestrator. As Airflow gained wide adoption, we evaluated adopting the major improvements of release 2.0, but opted for building our own orchestrator instead to realize our dataset-centric, event-driven vision. Meanwhile, the 3.0 release aligned Airflow with the same vision we pursued internally, as a modern asset-driven orchestrator. It showed how futile it is to build our own compared to the momentum of the community. We evaluated several orchestrators and decided to join forces with the Airflow project. This talk follows our journey from building a custom orchestrator to adopting and contributing to Airflow 3. We’ll share our thought process, our asset partitions use case, and how we’re working with the community to materialize the Data Awareness (AIP-73) vision. Partition-based incremental scheduling is core to our orchestration model, enabling scalable, observable pipelines thanks to Datadog’s Data Observability product providing visibility into pipeline health.

Why Data Teams Keep Reinventing the Wheel: The Struggle for Code Reuse in the Data Transformation La

2025-07-01

session

Maxime Beauchemin (Preset)

Modern Data Stack

Data teams have a bad habit: reinventing the wheel. Despite the explosion of open-source tooling, best practices, and managed services, teams still find themselves building bespoke data platforms from scratch—often hitting the same roadblocks as those before them. Why does this keep happening, and more importantly, how can we break the cycle? In this talk, we’ll unpack the key reasons data teams default to building rather than adopting, from technical nuances to cultural and organizational dynamics. We’ll discuss why fragmentation in the modern data stack, the pressure to “own” infrastructure, and the allure of in-house solutions make this problem so persistent. Using real-world examples, we’ll explore strategies to help data teams focus on delivering business value rather than endlessly rebuilding foundational infrastructure. Whether you’re an engineer, a data leader, or an open-source contributor, this session will provide insights into navigating the build-vs-buy tradeoff more effectively.

Your first Apache Airflow Contribution

2025-07-01

session

Ryan Hatter , Kalyan Reddy , Phani Kumar , Amogh Rajesh Desai

Airflow Docker

Ready to contribute to Apache Airflow? In this hands-on workshop, you’ll be expected to come prepared with your development environment already configured (Breeze installed is strongly recommended, but Codespaces works if you can’t install Docker). We’ll dive straight into finding issues that match your skills and walk you through the entire contribution process—from creating your first pull request to receiving community feedback. Whether you’re writing code, enhancing documentation, or offering feedback, there’s a place for you. Let’s get started and see your name among Airflow contributors!

Your privacy or our progress: rethinking telemetry in Airflow

2025-07-01

session

Bolke de Bruin

Airflow

We face a paradox: we could use usage data to build better software, but collecting that data seems to contradict the very principles of user freedom that open source represents. Apache Airflow’s current telemetry - already purged - system has become a battleground for this conflict, with some users voicing concerns over privacy while maintainers struggle to make informed decisions without data. What can we do to strike the right balance?

talk-data.com

Top Topics

Top Speakers

Simplifying Data Management with DAG Factory

Task failures troubleshooting based on Airflow & Kubernetes signals

The Secret to Airflow's Evergreen Build: CI/CD magic

Transforming Data Engineering: Achieving Efficiency and Ease with an Intuitive Orchestration Solution

UI Office Hours

Unleash Airflow's Potential with hands-on Performance Optimization

Unlocking Event-Driven Scheduling in Airflow 3: A New Era of Reactive Data Pipelines

Using Airflow for Real-Time Data Processing at Scale: Architecture, Challenges & Wins

Using Apache Airflow with Trino for (almost) all your data problems

Vayu: The Airflow Copilot

Why AWS chose Apache Airflow to power workflows for the next generation of Amazon SageMaker

Why Datadog Chose Airflow 3: Multi-Tenancy, Observability, and the Future of Event-Driven Workflows

Why Data Teams Keep Reinventing the Wheel: The Struggle for Code Reuse in the Data Transformation La

Your first Apache Airflow Contribution

Your privacy or our progress: rethinking telemetry in Airflow