Airflow 3 introduced a game-changing feature: Dag versioning. Gone are the days of “latest only” Dags and confusing, inconsistent UI views when pipelines change mid-flight. This talk covers: Visualizing Dag changes over time in the UI How Dags code is versioned and can be grabbed from external sources Executing a whole Dag run against the same code version Dynamic Dags? Where do they fit in?! You’ll see real-world scenarios, UI demos, and learn how these advancements will help avoid “Airflow amnesia”.
talk-data.com
Topic
Airflow
Apache Airflow
682
tagged
Activity Trend
Top Events
KP Division of Research uses Airflow as a central technology for integrating diverse technologies in an agile setting. We wish to present a set of use-cases for AI/ML workloads, including imaging analysis (tissue segmentation, mammography), NLP (early identification of psychosis), LLM processing (identification of vessel diameter from radiological impressions), and other large data processing tasks. We create these “short-lived” project workflows to accomplish specific aims, and then may never run the job again, so leveraging generalized patterns are crucial to quickly implementing these jobs. Our Advanced Computational Infrastructure is comprised of multiple Kubernetes clusters, and we use Airflow to democratize the use of our batch level resources in those clusters. We use Airflow form-based parameters to deploy pods running R and Python scripts where generalized parameters are injected into scripts that follow internal programming patterns. Finally, we also leverage Airflow to create headless services inside Kubernetes for large computational workloads (Spark & H2O) that subsequent pods consume ephemerally.
One of the exciting new features in Airflow 3 is internationalization (i18n), bringing multilingual support to the UI and making Airflow more accessible to users worldwide. This talk will highlight the UI changes made to support different languages, including locale-aware adjustments. We’ll discuss how translations are contributed and managed — including the use of LLMs to accelerate the process — and why human review remains an essential part of it. We’ll present the i18n policy designed to ensure long-term maintainability, along with the tooling developed to support it. Finally, you’ll learn how to get involved and contribute to Airflow’s global reach by translating or reviewing content in your language.
As Apache Airflow adoption accelerates for data pipeline orchestration, integrating it effectively into your enterprise’s Automation Center of Excellence (CoE) is crucial for maximizing ROI, ensuring governance, and standardizing best practices. This session explores common challenges faced when bringing specialized tools like Airflow into a broader CoE framework. We’ll demonstrate how leveraging enterprise automation platforms like Automic Automation can simplify this integration by providing centralized orchestration, standardized lifecycle management, and unified auditing for Airflow DAGs alongside other enterprise workloads. Furthermore, discover how Automation Analytics & Intelligence (AAI) can offer the CoE a single pane of glass for monitoring performance, tracking SLAs, and proving the business value of Airflow initiatives within the complete automation landscape. Learn practical strategies to ensure Airflow becomes a well-governed, high-performing component of your overall automation strategy.
As Apache Airflow adoption accelerates for data pipeline orchestration, integrating it effectively into your enterprise’s Automation Center of Excellence (CoE) is crucial for maximizing ROI, ensuring governance, and standardizing best practices. This session explores common challenges faced when bringing specialized tools like Airflow into a broader CoE framework. We’ll demonstrate how leveraging enterprise automation platforms like Automic Automation can simplify this integration by providing centralized orchestration, standardized lifecycle management, and unified auditing for Airflow DAGs alongside other enterprise workloads. Furthermore, discover how Automation Analytics & Intelligence (AAI) can offer the CoE a single pane of glass for monitoring performance, tracking SLAs, and proving the business value of Airflow initiatives within the complete automation landscape. Learn practical strategies to ensure Airflow becomes a well-governed, high-performing component of your overall automation strategy.
This session will detail Allegro’s, a leading e-commerce company in Poland, journey with Apache Airflow. It will chart our evolution from a custom, on-premises Airflow-as-a-Service solution through a significant expansion to over 300 Cloud Composer instances in Google Cloud, culminating in Airflow becoming the core of our data processing. We orchestrate over 64,000 regular tasks spanning over 6,000 active DAGs on more than 200 Airflow instances. From feeding business-supporting dashboards, to managing main data marts, and handling ML pipelines, and more. We will share our practical experiences, lessons learned, and the strategies employed to manage and scale this critical infrastructure. Furthermore, we will introduce our innovative economy-of-share approach for providing ready-to-use Airflow environments, significantly enhancing both user productivity and cost efficiency.
The general-purpose nature of Airflow has always left us questioning, “Is this the right way”? While the existing resources and community cover them, the new Airflow releases always leave us wondering if there is more . This talk reveals how 3.0’s innovations redefine best practices, building production-ready data platforms. • Dag Development - Future-proof your dags without compromising on Fundamentals • Modern Pipelines: How to best incorporate new Airflow features • Infrastructure: Leveraging 3.0’s Service-Oriented Architecture and Edge Executor • Teams & Responsibilities: Streamlined operations with the new split CLI and improved UI. • Monitoring & Observability: Building fail-proof pipelines
Red Hat’s unified data and AI platform relies on Apache Airflow for orchestration, alongside Snowflake, Fivetran, and Atlan. The platform prioritizes building a dependable data foundation, recognizing that effective AI depends on quality data. Airflow was selected for its predictability, extensive connectivity, reliability, and scalability. The platform now supports business analytics, transitioning from ETL to ELT processes. This has resulted in a remarkable improvement in how we make data available for business decisions. The platform’s capabilities are being extended to power Digital Workers (AI agents) using large language models, encompassing model training, fine-tuning, and inference. Two Digital Workers are currently deployed, with more in development. This presentation will detail the rationale and background of this evolution, followed by an explanation of the architectural decisions made and the challenges encountered and resolved throughout the process of transforming into an AI-enabled data platform to power Red Hat’s business.
Airflow Asset originated from data lineage and evolved into its current state, being used as a scheduling concept (data-aware, event-based scheduling). It has even more potential. This talk discusses how other parts of Airflow, namely Connection and Object Storage, contain concepts related to Asset, and we can tie them all together to make task authoring flow even more naturally. Planned topics: Brief history on Asset and related constructs. Current state of Asset concepts. Inlets, anyone? Finding inspiration from Pydantic et al. My next step for Asset.
In today’s fast-paced business world, timely and reliable insights are crucial — but manual BI workflows can’t keep up. This session offers a practical guide to automating business intelligence processes using Apache Airflow. We’ll walk through real-world examples of automating data extraction, transformation, dashboard refreshes, and report distribution. Learn how to design DAGs that align with business SLAs, trigger workflows based on events, integrate with popular BI tools like Tableau and Power BI, and implement alerting and failure recovery mechanisms. Whether you’re new to Airflow or looking to scale your BI operations, this session will equip you with actionable strategies to save time, reduce errors, and supercharge your organization’s decision-making capabilities.
I will talk about how Apache Airflow is used in the healthcare sector with the integration of LLMs to enhance efficiency. Healthcare generates vast volumes of unstructured data daily, from clinical notes and patient intake forms to chatbot conversations and telehealth reports. Medical teams struggle to keep up, leading to delays in triage and missed critical symptoms. This session explores how Apache Airflow can be the backbone of an automated healthcare triage system powered by Large Language Models (LLMs). I’ll demonstrate how I designed and implemented an Airflow DAG orchestration pipeline that automates the ingestion, processing, and analysis of patient data from diverse sources in real-time. Airflow schedules and coordinates data extraction, preprocessing, LLM-based symptom extraction, and urgency classification, and finally routes actionable insights to healthcare professionals. The session will focus on the following; Managing complex workflows in healthcare data pipelines Safely integrating LLM inference calls into Airflow tasks Designing human-in-the-loop checkpoints for ethical AI usage Monitoring workflow health and data quality with Airflow.
Security teams often face alert fatigue from massive volumes of raw log data. This session demonstrates how to combine Apache Airflow, Wazuh, and LLMs to build automated pipelines for smarter threat triage—grounded in the MITRE ATT&CK framework. We’ll explore how Airflow can orchestrate a full workflow: ingesting Wazuh alerts, using LLMs to summarize log events, matching behavior to ATT&CK tactics and techniques, and generating enriched incident summaries. With AI-powered interpretation layered on top of structured threat intelligence, teams can reduce manual effort while increasing context and clarity. You’ll learn how to build modular DAGs that automate: • Parsing and routing Wazuh alerts, • Querying LLMs for human-readable summaries, • Mapping IOCs to ATT&CK using vector similarity or prompt templates, • Outputting structured threat reports for analysts. The session includes a real-world example integrating open-source tools and public ATT&CK data, and will provide reusable components for rapid adoption. If you’re a SecOps engineer or ML practitioner in cybersecurity, this talk gives you a practical blueprint to deploy intelligent, scalable threat automation.
Apache Airflow’s executor landscape has traditionally presented users with a clear trade-off: choose either the speed of local execution or the scalability, isolation and configurability of remote execution. The AWS Lambda Executor introduces a new paradigm that bridges this gap, offering near-local execution speeds with the benefits of remote containerization. This talk will begin with a brief overview of Airflow’s executors, how they work and what they are responsible for, highlighting the compromises between different executors. We will explore the emerging niche for fast, yet remote execution and demonstrate how the AWS Lambda Executor fills this space. We will also address practical considerations when using such an executor, such as working within Lambda’s 15 minute execution limit, and how to mitigate this using multi-executor configuration. Whether you’re new to Airflow or an experienced user, this session will provide valuable insights into task execution and how you can combine the best of both local and remote execution paradigms.
How a Complete Beginner in Data Engineering / Junior Computer Science Student Became an Apache Airflow Committer in Just 5 Months—With 70+ PRs and 300 Hours of Contributions This talk is aimed at those who are still hesitant about contributing to Apache Airflow. I hope to inspire and encourage anyone to take the first step and start their journey in open-source—let’s build together!
Ensuring the stability of a major release like Airflow 3 required extensive testing across multiple dimensions. In this session, we will dive into the testing strategies and validation techniques used to guarantee a smooth rollout. From unit and integration tests to real-world DAG validations, this talk will cover the challenges faced, key learnings, and best practices for testing Airflow. Whether you’re a contributor, QA engineer, or Airflow user preparing for migration, this session will offer valuable takeaways to improve your own testing approach.
As teams scale their Airflow workflows, a common question is: “My DAG has 5,000 tasks—how long will it take to run in Airflow?” Beyond execution time, users often face challenges with dynamically generated DAGs, such as: Delayed visualization in the Airflow UI after deployment. High resource consumption, leading to Kubernetes pod evictions and out-of-memory errors. While estimating the resource utilization in a distributed data platform is complex, benchmarking can provide crucial insights. In this talk, we’ll share our approach to benchmarking dynamically generated DAGs with Astronomer Cosmos ( https://github.com/astronomer/astronomer-cosmos) , covering: Designing representative and extensible baseline tests. Setting up an isolated, distributed infrastructure for benchmarking. Running reproducible performance tests. Measuring DAG run times and task throughput. Evaluating CPU & memory consumption to optimize deployments. By the end of this session, you will have practical benchmarks and strategies for making informed decisions about evaluating the performance of DAGs in Airflow.
In legacy Airflow 2.x, each DAG run was tied to a unique “execution_date.” By removing this requirement, Airflow can now directly support a variety of new use cases, such as model training and generative AI inference, without the need for hacks and workarounds typically used by machine learning and AI engineers. In this talk, we will delve into the significant advancements in Airflow 3 that enable GenAI and MLOps use cases, particularly through the changes outlined in AIP 83. We’ll cover key changes like the renaming of “execution_date” to “logical_date,” along with the allowance for it to be null, and the introduction of the new “run_after” field which provides a more meaningful mechanism for scheduling and sorting. Furthermore, we’ll discuss how Airflow 3 enables multiple parallel runs, empowering diverse triggering mechanisms and easing backfill logic with a real-world demo.
Using OpenTelemetry tracing, users can gain full visibility into tasks and calls to outside services. This is an increasingly important skill, especially as tasks in an Airflow DAG involve multiple complex computations which take hours or days to complete. Airflow allows users to easily monitor how long entire DAG runs or individual tasks take, but preserves the anonymity of internal actions. OpenTelemetry gives users much more operational awareness and metrics they can use to improve operations. This presentation will explain the basics: what OpenTelemetry is and how it works – perfect for someone with no prior familiarity with tracing or with the use of OpenTelemetry. It will demonstrate how Airflow users can leverage the new tracing support to achieve deeper observability into DAG runs.
Airflow 3 made some great strides with AIP-66, introducing the concept of a DAG bundle. This successfully challenged one of the fundamental architectural limitations of original Airflow design of how DAGs are deployed, bringing the structure to something that often had to be operated as a pile of files in the past. However, we believe that this by no means should be the end of the road when it comes to making the DAG management easier, authoring more accessible to a broader audience, and integration with Data Agents smoother. We believe that the next step in Airflow’s evolution is in having a native option to break away from the necessity of having a real file in file systems on multiple components to have your DAG up and running. This is what we are hoping to achieve as part of AIP-85 - extendable DAG parsing control. In this talk I’d like to give a detailed overview of how we want to make it happen and show the examples of the valuable integrations we hope to unblock with it.
Efficiently handling long-running workflows is crucial for scaling modern data pipelines. Apache Airflow’s deferrable operators help offload tasks during idle periods — freeing worker slots while tracking progress. This session explores how Cosmos 1.9 ( https://github.com/astronomer/astronomer-cosmos ) integrates Airflow’s deferrable capabilities to enhance orchestrating dbt ( https://github.com/dbt-labs/dbt-core ) in production, with insights from recent contributions that introduced this functionality. Key takeaways: Deferrable Operators: How they work and why they’re ideal for long-running dbt tasks. Integrating with Cosmos: Refactoring and enhancements to enable deferrable behaviour across platforms. Performance Gains: Resource savings and task throughput improvements from deferrable execution. Challenges & Future Enhancements: Lessons learned, compatibility, and ideas for broader support. Whether orchestrating dbt models on a cloud warehouse or managing large-scale transformations, this session offers practical strategies to reduce resource contention and boost pipeline performance.