Airflow Asset originated from data lineage and evolved into its current state, being used as a scheduling concept (data-aware, event-based scheduling). It has even more potential. This talk discusses how other parts of Airflow, namely Connection and Object Storage, contain concepts related to Asset, and we can tie them all together to make task authoring flow even more naturally. Planned topics: Brief history on Asset and related constructs. Current state of Asset concepts. Inlets, anyone? Finding inspiration from Pydantic et al. My next step for Asset.
talk-data.com
Topic
Airflow
Apache Airflow
139
tagged
Activity Trend
Top Events
In today’s fast-paced business world, timely and reliable insights are crucial — but manual BI workflows can’t keep up. This session offers a practical guide to automating business intelligence processes using Apache Airflow. We’ll walk through real-world examples of automating data extraction, transformation, dashboard refreshes, and report distribution. Learn how to design DAGs that align with business SLAs, trigger workflows based on events, integrate with popular BI tools like Tableau and Power BI, and implement alerting and failure recovery mechanisms. Whether you’re new to Airflow or looking to scale your BI operations, this session will equip you with actionable strategies to save time, reduce errors, and supercharge your organization’s decision-making capabilities.
I will talk about how Apache Airflow is used in the healthcare sector with the integration of LLMs to enhance efficiency. Healthcare generates vast volumes of unstructured data daily, from clinical notes and patient intake forms to chatbot conversations and telehealth reports. Medical teams struggle to keep up, leading to delays in triage and missed critical symptoms. This session explores how Apache Airflow can be the backbone of an automated healthcare triage system powered by Large Language Models (LLMs). I’ll demonstrate how I designed and implemented an Airflow DAG orchestration pipeline that automates the ingestion, processing, and analysis of patient data from diverse sources in real-time. Airflow schedules and coordinates data extraction, preprocessing, LLM-based symptom extraction, and urgency classification, and finally routes actionable insights to healthcare professionals. The session will focus on the following; Managing complex workflows in healthcare data pipelines Safely integrating LLM inference calls into Airflow tasks Designing human-in-the-loop checkpoints for ethical AI usage Monitoring workflow health and data quality with Airflow.
Security teams often face alert fatigue from massive volumes of raw log data. This session demonstrates how to combine Apache Airflow, Wazuh, and LLMs to build automated pipelines for smarter threat triage—grounded in the MITRE ATT&CK framework. We’ll explore how Airflow can orchestrate a full workflow: ingesting Wazuh alerts, using LLMs to summarize log events, matching behavior to ATT&CK tactics and techniques, and generating enriched incident summaries. With AI-powered interpretation layered on top of structured threat intelligence, teams can reduce manual effort while increasing context and clarity. You’ll learn how to build modular DAGs that automate: • Parsing and routing Wazuh alerts, • Querying LLMs for human-readable summaries, • Mapping IOCs to ATT&CK using vector similarity or prompt templates, • Outputting structured threat reports for analysts. The session includes a real-world example integrating open-source tools and public ATT&CK data, and will provide reusable components for rapid adoption. If you’re a SecOps engineer or ML practitioner in cybersecurity, this talk gives you a practical blueprint to deploy intelligent, scalable threat automation.
Apache Airflow’s executor landscape has traditionally presented users with a clear trade-off: choose either the speed of local execution or the scalability, isolation and configurability of remote execution. The AWS Lambda Executor introduces a new paradigm that bridges this gap, offering near-local execution speeds with the benefits of remote containerization. This talk will begin with a brief overview of Airflow’s executors, how they work and what they are responsible for, highlighting the compromises between different executors. We will explore the emerging niche for fast, yet remote execution and demonstrate how the AWS Lambda Executor fills this space. We will also address practical considerations when using such an executor, such as working within Lambda’s 15 minute execution limit, and how to mitigate this using multi-executor configuration. Whether you’re new to Airflow or an experienced user, this session will provide valuable insights into task execution and how you can combine the best of both local and remote execution paradigms.
How a Complete Beginner in Data Engineering / Junior Computer Science Student Became an Apache Airflow Committer in Just 5 Months—With 70+ PRs and 300 Hours of Contributions This talk is aimed at those who are still hesitant about contributing to Apache Airflow. I hope to inspire and encourage anyone to take the first step and start their journey in open-source—let’s build together!
Ensuring the stability of a major release like Airflow 3 required extensive testing across multiple dimensions. In this session, we will dive into the testing strategies and validation techniques used to guarantee a smooth rollout. From unit and integration tests to real-world DAG validations, this talk will cover the challenges faced, key learnings, and best practices for testing Airflow. Whether you’re a contributor, QA engineer, or Airflow user preparing for migration, this session will offer valuable takeaways to improve your own testing approach.
As teams scale their Airflow workflows, a common question is: “My DAG has 5,000 tasks—how long will it take to run in Airflow?” Beyond execution time, users often face challenges with dynamically generated DAGs, such as: Delayed visualization in the Airflow UI after deployment. High resource consumption, leading to Kubernetes pod evictions and out-of-memory errors. While estimating the resource utilization in a distributed data platform is complex, benchmarking can provide crucial insights. In this talk, we’ll share our approach to benchmarking dynamically generated DAGs with Astronomer Cosmos ( https://github.com/astronomer/astronomer-cosmos) , covering: Designing representative and extensible baseline tests. Setting up an isolated, distributed infrastructure for benchmarking. Running reproducible performance tests. Measuring DAG run times and task throughput. Evaluating CPU & memory consumption to optimize deployments. By the end of this session, you will have practical benchmarks and strategies for making informed decisions about evaluating the performance of DAGs in Airflow.
In legacy Airflow 2.x, each DAG run was tied to a unique “execution_date.” By removing this requirement, Airflow can now directly support a variety of new use cases, such as model training and generative AI inference, without the need for hacks and workarounds typically used by machine learning and AI engineers. In this talk, we will delve into the significant advancements in Airflow 3 that enable GenAI and MLOps use cases, particularly through the changes outlined in AIP 83. We’ll cover key changes like the renaming of “execution_date” to “logical_date,” along with the allowance for it to be null, and the introduction of the new “run_after” field which provides a more meaningful mechanism for scheduling and sorting. Furthermore, we’ll discuss how Airflow 3 enables multiple parallel runs, empowering diverse triggering mechanisms and easing backfill logic with a real-world demo.
Using OpenTelemetry tracing, users can gain full visibility into tasks and calls to outside services. This is an increasingly important skill, especially as tasks in an Airflow DAG involve multiple complex computations which take hours or days to complete. Airflow allows users to easily monitor how long entire DAG runs or individual tasks take, but preserves the anonymity of internal actions. OpenTelemetry gives users much more operational awareness and metrics they can use to improve operations. This presentation will explain the basics: what OpenTelemetry is and how it works – perfect for someone with no prior familiarity with tracing or with the use of OpenTelemetry. It will demonstrate how Airflow users can leverage the new tracing support to achieve deeper observability into DAG runs.
Airflow 3 made some great strides with AIP-66, introducing the concept of a DAG bundle. This successfully challenged one of the fundamental architectural limitations of original Airflow design of how DAGs are deployed, bringing the structure to something that often had to be operated as a pile of files in the past. However, we believe that this by no means should be the end of the road when it comes to making the DAG management easier, authoring more accessible to a broader audience, and integration with Data Agents smoother. We believe that the next step in Airflow’s evolution is in having a native option to break away from the necessity of having a real file in file systems on multiple components to have your DAG up and running. This is what we are hoping to achieve as part of AIP-85 - extendable DAG parsing control. In this talk I’d like to give a detailed overview of how we want to make it happen and show the examples of the valuable integrations we hope to unblock with it.
Efficiently handling long-running workflows is crucial for scaling modern data pipelines. Apache Airflow’s deferrable operators help offload tasks during idle periods — freeing worker slots while tracking progress. This session explores how Cosmos 1.9 ( https://github.com/astronomer/astronomer-cosmos ) integrates Airflow’s deferrable capabilities to enhance orchestrating dbt ( https://github.com/dbt-labs/dbt-core ) in production, with insights from recent contributions that introduced this functionality. Key takeaways: Deferrable Operators: How they work and why they’re ideal for long-running dbt tasks. Integrating with Cosmos: Refactoring and enhancements to enable deferrable behaviour across platforms. Performance Gains: Resource savings and task throughput improvements from deferrable execution. Challenges & Future Enhancements: Lessons learned, compatibility, and ideas for broader support. Whether orchestrating dbt models on a cloud warehouse or managing large-scale transformations, this session offers practical strategies to reduce resource contention and boost pipeline performance.
Investigative journalism often relies on uncovering hidden patterns in vast amounts of unstructured and semi-structured data. At the FT, we leverage Airflow to orchestrate AI-powered pipelines that transform complex, fragmented datasets into structured insights. Our Storyfinding team works closely with journalists to automate tedious data processing, enabling them to tell stories that might otherwise go untold. This talk will explore how we use Airflow to process and analyze text, documents, and other difficult-to-structure data sources combining AI, machine learning, and advanced computational techniques to extract meaningful entities, relationships, and patterns. We’ll also showcase our connection analysis workflows, which link various datasets to reveal previously hidden chains of people and companies, a crucial capability for investigative reporting. Attendees will learn: How Airflow can orchestrate AI-driven pipelines for handling unstructured and semi-structured data. Techniques for automating connection analysis to support investigative journalism. Lessons from our experience working with journalists to develop data-driven storytelling and storyfinding capabilities.
AI and ML pipelines built in Airflow often power critical business outcomes, but they rarely operate in isolation. In this hands-on workshop, learn how Control-M integrates with Airflow to orchestrate end-to-end workflows that include upstream and downstream enterprise systems like Supply Chain and Billing. Gain visibility, reliability, and seamless coordination across your data pipelines and the business operations they support.
Want to be resilient to any zonal/regional down events when building Airflow in a cloud environment? Unforeseen disruptions in cloud infrastructure, whether isolated to specific zones or impacting entire regions, pose a tangible threat to the continuous operation of critical data workflows managed by Airflow. These outages, though often technical in nature, translate directly into real-world consequences, potentially causing interruptions in essential services, delays in crucial information delivery, and ultimately impacting the reliability and efficiency of various operational processes that businesses and individuals depend upon daily. The inability to process data reliably due to infrastructure instability can cascade into tangible setbacks across diverse sectors, highlighting the urgent need for resilient and robust Airflow deployments. Let’s dive deep into strategies for building truly resilient Airflow setups that can withstand zonal and even regional down events. We’ll explore architectural patterns like multi-availability zone deployments, cross-region failover mechanisms, and robust data replication techniques to minimise downtime and ensure business continuity. Discover practical tips and best practices for having a resilient Airflow infrastructure. By attending this presentation, you’ll gain the knowledge and tools necessary to significantly improve the reliability and stability of your critical data pipelines, ultimately saving time, resources, and preventing costly disruptions.
As organizations scale their data infrastructure, Apache Airflow becomes a mission-critical component for orchestrating workflows efficiently. But scaling Airflow successfully isn’t just about running pipelines—it’s about building a Center of Excellence (CoE) that empowers teams with the right strategy, best practices, and long-term enablement. Join Jon Leek and Michelle Winters as they share their experiences helping customers design and implement Airflow Centers of Excellence. They’ll walk through real-world challenges, best practices, and the structured approach Astronomer takes to ensure teams have the right plan, resources, and support to succeed. Whether you’re just starting with Airflow or looking to optimize and scale your workflows, this session will give you a proven framework to build a sustainable Airflow Center of Excellence within your organization. 🚀
As your organization scales to 20+ data science teams and 300+ DS/ML/DE engineers, you face a critical challenge: how to build a secure, reliable, and scalable orchestration layer that supports both fast experimentation and stable production workflows. We chose Airflow — and didn’t regret it! But to make it truly work at our scale, we had to rethink its architecture from the ground up. In this talk, we’ll share how we turned Airflow into a powerful MLOps platform through its core capability: running pipelines across multiple K8s GPU clusters from a single UI (!) using per-cluster worker pools. To support ease of use, we developed MLTool — our own library for fast and standardized DAG development, integrated Vault for secure secret management across teams, enabled real-time logging with S3 persistence and built a custom SparkSubmitOperator for Kerberos-authenticated Spark/Hadoop jobs in Kubernetes. We also streamlined the developer experience — users can generate a GitLab repo and deploy a versioned pipeline to prod in under 10 minutes! We’re proud of what we’ve built — and our users are too. Now we want to share it with the world!
As modern data ecosystems grow in complexity, ensuring transparency, discoverability, and governance in data workflows becomes critical. Apache Airflow, a powerful workflow orchestration tool, enables data engineers to build scalable pipelines, but without proper visibility into data lineage, ownership, and quality, teams risk operating in a black box. In this talk, we will explore how integrating Airflow with a data catalog can bring clarity and transparency to data workflows. We’ll discuss how metadata-driven orchestration enhances data governance, enables lineage tracking, and improves collaboration across teams. Through real-world use cases, we will demonstrate how Airflow can automate metadata collection, update data catalogs dynamically, and ensure data quality at every stage of the pipeline. Attendees will walk away with practical strategies for implementing a transparent data workflow that fosters trust, efficiency, and compliance in their data infrastructure.
During this workshop you are going to learn the latest features published within Cloud Composer which is a managed service for Apache Airflow on Google Cloud Platform.
During this sessions audience is going to learn about newest feature of managed Airflow offering provided by Google Cloud. If you would like to operate Airflow at scale or in regulated environments then this session is for you.