Airflow Summit 2025

Driving Analytics with Open Source: Airbyte, dbt, Airflow & Metabase

2025-07-01

session

Ayoade Adegbite

Airbyte Airflow Analytics dbt Metabase postgresql

In this talk, I’ll walk through how we built an end-to-end analytics pipeline using open-source tools ( Airbyte, dbt, Airflow, and Metabase). At WirePick, we extract data from multiple sources using Airbyte OSS into PostgreSQL, transform it into business-specific data marts with dbt, and automate the entire workflow using Airflow. Our Metabase dashboards provide real-time insights, and we integrate Slack notifications to alert stakeholders when key business metrics change. This session will cover: Data extraction: Using Airbyte OSS to pull data from multiple sources Transformation & Modeling: How dbt helps create reusable data marts Automation & Orchestration: Managing the workflow with Airflow Data-driven decision-making: Delivering insights through Metabase & Slack alerts

Dynamic DAGs and Data Quality using DAGFactory

2025-07-01

session

Gangfeng Huang , Ashir Alam

Airflow BI Cloud Computing Data Quality GCP GitHub

We have a similar pattern of DAGs running for different data quality dimensions like accuracy, timeliness, & completeness. To do this again and again, we would be duplicating and potentially introducing human error while doing copy paste of code or making people write same code again. To solve for this, we are doing few things: Run DAGs via DagFactory to dynamically generate DAGs using just some YAML code for all the steps we want to run in our DQ checks. Hide this behind a UI which is hooked to github PR open step, now the user just provides some inputs or selects from dropdown in UI and a YAML DAG is generated for them. This highlights the potential for DAGFactory to hide Airflow Python code from users and make it more accessible to Data Analysts and Business Intelligence along with normal Software Engg, along with reducing human error. YAML is the perfect format to be able to generate code, create a PR and DagFactory is the perfect fir for that. All of this is running in GCP Cloud Composer.

Dynamic Data Pipelines with DBT and Airflow

2025-07-01

session

Miquel Angel Andreu Febrer

Airflow CI/CD Data Quality dbt

This session showcases Okta’s innovative approach to data pipeline orchestration with dbt and Airflow. How we’ve implemented dynamically generated airflow dags workflows based on dbt’s dependency graph. This allows us to enforce strict data quality standards by automatically executing downstream model tests before upstream model deployments, effectively preventing error cascades. The entire CI/CD pipeline, from dbt model changes to production DAG deployment, is fully automated. The result? Accelerated development cycles, reduced operational overhead, and bulletproof data reliability

EdgeExecutor / Edge Worker - The new option to run anywhere

2025-07-01

session

Daniel Wolf , Jens Scheffler

Airflow Cloud Computing

Airflow 3 extends the deployment options to run your workload anywhere. You don’t need to bring your data to airflow but you can bring the execution where it needs to be. You can connect any cloud and on-prem location together and generate a hybrid workflow from one central Airflow instance. Only a HTTP connection is needed. We will present the use cases and concepts of the Edge deployment and how it is working also in a hybrid setup with Celery or other executors.

ELT and Elections: Cloud-agnostic patterns for real-time analysis

2025-07-01

session

Kyle McCluskey

AI/ML Airflow AWS AWS Lambda BigQuery Cloud Computing

Discover how Apache Airflow powers scalable ELT pipelines, enabling seamless data ingestion, transformation, and machine learning-driven insights. This session will walk through: Automating Data Ingestion: Using Airflow to orchestrate raw data ingestion from third-party sources into your data lake (S3, GCP), ensuring a steady pipeline of high-quality training and prediction data. Optimizing Transformations with Serverless Computing: Offloading intensive transformations to serverless functions (GCP Cloud Run, AWS Lambda) and machine learning models (BigQuery ML, Sagemaker), integrating their outputs seamlessly into Airflow workflows. Real-World Impact: A case study on how INTRVL leveraged Airflow, BigQuery ML, and Cloud Run to analyze early voting data in near real-time, generating actionable insights on voter behavior across swing states. This talk not only provides a deep dive into the Political Tech space but also serves as a reference architecture for building robust, repeatable ELT pipelines. Attendees will gain insights into modern serverless technologies from AWS and GCP that enhance Airflow’s capabilities, helping data engineers design scalable, cloud-agnostic workflows.

Empowering Precision Healthcare with Apache Airflow-iKang Healthcare Group’s DataHub Journey

2025-07-01

session

Yuan Luo , Huiliang Zhang

Airflow Data Lakehouse LLM Data Streaming

iKang Healthcare Group, serving nearly 10 million patients annually, built a centralized healthcare data hub powered by Apache Airflow to support its large-scale, real-time clinical operations. The platform integrates batch and streaming data in a lakehouse architecture, orchestrating complex workflows from data ingestion (HL7/FHIR) to clinical decision support. Healthcare data’s inherent complexity—spanning structured lab results to unstructured clinical notes—requires dynamic, reliable orchestration. iKang uses Airflow’s DAGs, extensibility, and workflow-as-code capabilities to address challenges like multi-system coordination, semantic data linking, and fault-tolerant automation. iKang extended Airflow with cross-DAG event triggers, task priority weights, LLM-driven clinical text processing, and a visual drag-and-drop DAG builder for medical teams. These innovations improved diagnostic turnaround, patient safety, and cross-system workflow visibility. iKang’s work demonstrates Airflow’s power in transforming healthcare data infrastructure and advancing intelligent, scalable patient care.

Enabling SQL testing in Airflow workflows using Pydantic types

2025-07-01

session

Gurmeet Saran , Kushal Thakkar

Airflow AWS AWS Glue Data Quality Pydantic SQL

This session explores how to bring unit testing to SQL pipelines using Airflow. I’ll walk through the development of a SQL testing library that allows isolated testing of SQL logic by injecting mock data into base tables. To support this, we built a type system for AWS Glue tables using Pydantic, enabling schema validation and mock data generation. Over time, this type system also powered production data quality checks via a custom Airflow operator. Learn how this approach improves reliability, accelerates development, and scales testing across data workflows.

Enhancing Airflow REST API: From Basic Integration to Enterprise Scale

2025-07-01

session

Vishal Vijayvargiya

Airflow API

Apache Airflow’s REST API has evolved to support diverse orchestration needs, with managed services like MWAA introducing custom enhancements. One such feature, InvokeRestApi, enables dynamic interactions with external services while maintaining Airflow’s core orchestration capabilities. In this talk, we will explore the architectural design behind InvokeRestApi, detailing how it enhances API-driven workflows. Beyond the architecture, we’ll share key challenges and learnings from implementing and scaling Airflow’s REST API in production environments. Topics include authentication, performance considerations, error handling, and best practices for integrating external APIs efficiently. Attendees will gain a deeper understanding of Airflow’s API extensibility, its implications for workflow automation, and actionable insights for building robust, API-driven orchestration solutions. Whether you’re an Airflow user or an architect, this session will provide valuable takeaways for simplifying API interactions across airflow environments.

Enhancing DAG Management with DMS: A Scalable Solution for Airflow

2025-07-01

session

DaeHoon Song , Sungji Yang

Airflow GitHub

In this talk, we will introduce the DAG Management Service (DMS), developed to address critical challenges in managing Airflow clusters. With over 10,000 active DAGs, a single Airflow cluster faces scaling limits and noisy neighbor issues, impacting task scheduling SLAs. DMS enhances reliability by distributing DAGs across multiple clusters and enforcing proper configurations. We will also discuss how DMS streamlines Airflow version upgrades. Upgrading from an old Airflow version to the latest requires sequential updates and code modifications for over 10,000 DAGs. DMS proposes an efficient upgrade method, reducing dependency on users. Key functions of DMS include: DAG Deployment: Selectively deploys DAG files from GitHub to Airflow clusters via an event-driven pipeline. DAG Migration: Facilitates seamless DAG migration between clusters, supporting both cluster upgrades and team-specific deployments. Connections and Variables Management: Centralizes management of connection IDs and variables, ensuring consistency and smooth migrations. Join us to explore how DMS can revolutionize your Airflow DAG management, enhancing scalability, reliability, and efficiency.

Enhancing Small Retailer Visibility: Machine Learning Pipelines with Apache Airflow

2025-07-01

session

Hannah Lundrigan

AI/ML Airflow

Small retailers often lack the data visibility that larger companies rely on for decision-making. In this session, we’ll dive into how Apache Airflow powers end-to-end machine learning pipelines that process inventory and sales data, enabling retailers and suppliers to gain valuable industry insights. We’ll cover feature engineering, model training, and automated inference workflows, along with strategies for handling messy, incomplete retail data. We will discuss how Airflow enables scalable ML-driven insights that improve demand forecasting, product categorization, and supply chain optimization.

Enterprise Auditing: "The Verifiable Data Pipeline"

2025-07-01

session

Piotr Wieczorek , Rafal Biegacz

Airflow Cloud Computing GCP Cloud Composer

This session will dive deep into leveraging the robust logging and audit capabilities of Google Cloud Platform, Cloud Composer and Apache Airflow to establish a fully transparent and verifiable data orchestration layer. We’ll demonstrate how to track and attribute every change—from environment configuration to individual task execution—essential for meeting stringent enterprise governance, compliance, and auditing requirements.

Event-Driven Airflow 3.0: Real-Time Orchestration with Pub/Sub

2025-07-01

session

Andrea Bombino , Nawfel Bacha

Airflow Cloud Computing dbt GCP Pub/Sub

Traditional time-based scheduling in Airflow can lead to inefficiencies and delays. With Airflow 3.0, we can now leverage native event-driven DAG execution, enabling workflows to trigger instantly when data arrives—eliminating polling-based sensors and rigid schedules. This talk explores real-time orchestration using Airflow 3.0 and Google Cloud Pub/Sub. We’ll showcase how to build an event-driven pipeline where DAGs automatically trigger as new data lands, ensuring faster and more efficient processing. Through a live demo, we’ll demonstrate how Airflow listens to Pub/Sub messages and dynamically triggers dbt transformations only when fresh data is available. This approach improves scalability, reduces costs, and enhances orchestration efficiency. Key Takeaways: How event-driven DAGs work vs. traditional scheduling, Best practices for integrating Airflow with Pub/Sub,Eliminating polling-based sensors for efficiency,Live demo: Event-driven pipeline with Airflow 3.0, Pub/Sub & dbt. This session will showcase how Airflow 3.0 enables truly real-time orchestration.

Fine-Tuning Airflow: Parameters You May Not Know About

2025-07-01

session

Yu Lung Law , Ivan Sayapin

Airflow

The Bloomberg Data Platform Engineering team is responsible for managing, storing, and providing access to business and financial data used by financial professionals across the global capital markets. Our team utilizes Apache Airflow to orchestrate data workflows across various applications and Bloomberg Terminal functions. Over the years, we have fine-tuned our Airflow cluster to handle more than 1,000 ingestion DAGs, which has presented unique scalability challenges. In this session, we will share insights into several key Airflow parameters — some of which you may not be all that familiar with — that our team uses to optimize and scale the platform effectively.

From Centrailization to Autonomy: Managing Airflow Pipeline through Multi-Tenancy

2025-07-01

session

Silver Pang

Airflow CI/CD Data Engineering DevOps

At the enterprise level, managing Airflow deployments across multiple teams can become complex, leading to bottlenecks and slowed development cycles. We will share our journey of decentralizing Airflow repositories to empower data engineering teams with multi-tenancy, clean folder structures, and streamlined DevOps processes. We dive into how restructuring our Airflow architecture and utilizing repository templates allowed teams to generate new data pipelines effortlessly. This approach enables engineers to focus on business logic without worrying about underlying Airflow configurations. By automating deployments and reducing manual errors through CI/CD pipelines, we minimized operational overhead. However, this transformation wasn’t without challenges. We’ll discuss obstacles we faced, such as maintaining code consistency, variables, and utility functions across decentralized repositories; ensuring compliance in a multi-tenant environment; and managing the learning curve associated with new workflows. Join us to discover practical insights on how decentralizing Airflow repositories can boost team productivity and adapt to evolving business needs with minimal effort.

From Complexity to Simplicity with TaskHarbor: Trendyol's Path to a Unified Orchestration Platform

2025-07-01

session

Salih Goktug Kose , Burak Özdemir

Airflow Cloud Computing GCP Kubernetes Terraform YAML

At Trendyol, Turkey’s leading e-commerce company, Apache Airflow powers our task orchestration, handling DAGs with 500+ tasks, complex interdependencies, and diverse environments. Managing on-prem Airflow instances posed challenges in scalability, maintenance, and deployment. To address these, we built TaskHarbor, a fully managed orchestration platform with a hybrid architecture—combining Airflow on GKE with on-prem resources for optimal performance and efficiency. This talk covers how we: Enabled seamless DAG synchronization across environments using GCS Fuse. Optimized workload distribution via GCP’s HTTPS & TCP Load Balancers. Automated infrastructure provisioning (GKE, CloudSQL, Kubernetes) using Terraform. Simplified Airflow deployments by replacing Helm YAML files with a custom templating tool, reducing configurations to 10-15 lines. Built a fully automated deployment pipeline, ensuring zero developer intervention. We enhanced efficiency, reliability, and automation in hybrid orchestration by embracing a scalable, maintainable, and cloud-native strategy. Attendees will obtain practical insights into architecting Airflow at scale and optimizing deployments.

From Cron to Data-Aware: Evolving Airflow Scheduling at Scale

2025-07-01

session

Yunhao Qing (Notion Data Platform)

AI/ML Airflow ETL/ELT

As data platforms grow in complexity, so do the orchestration needs behind them. Time-based (cron) scheduling has long been the default in Airflow, but dataset-based scheduling promises a more data-aware, efficient alternative. In this session, I’ll share lessons learned from operating Airflow at scale—supporting thousands of DAGs across teams with varied use cases, from simple ETL to complex ML workflows. We’ll explore when dataset scheduling makes sense, the challenges it introduces, and how to evolve your DAG design and platform architecture to make the most of it. Whether you’re migrating legacy workflows or designing new ones, this talk will help you evaluate the right scheduling model for your needs.

From DAGs to Insights: Business-Driven Airflow Use Cases

2025-07-01

session

Tala Karadsheh

Airflow GitHub

Airflow is integral to GitHub’s data and insight generation. This session dives into use cases from GitHub where key business decisions are driven, at the root, with the help of Airflow. The session will also highlight how both GitHub and Airflow celebrate, promote, and nurture OSS innovations in their own ways.

From Legacy to Leading Edge: How Airflow Migration Unlocked Cross-Team Business Value

2025-07-01

session

Blagoy Kaloferov

AI/ML Airflow Analytics Cloud Computing

At TrueCar, migrating hundreds of legacy workflows from in-house orchestration tools to Apache Airflow required key technical decisions that transformed our data platform architecture and organizational capabilities. We consolidated individual chained tasks into optimized DAGs leveraging native Airflow functionality to trigger compute across cloud environments. A crucial breakthrough was developing DAG generators to scale migration—essential for efficiently migrating hundreds of workflows while maintaining consistency. By decoupling orchestration from compute, we gained flexibility to select optimal tools for specific outcomes—programmatic processing, analytics, batch jobs, or AI/ML pipelines. This resulted in cost reductions, performance improvements, and team agility. We also gained unprecedented visibility into DAG performance and dependency patterns previously invisible across fragmented systems. Attendees will learn how we redesigned complex workflows into efficient DAGs using dynamic task generation, architectural decisions that enabled platform innovation and the decision framework that made our migration transformational.

From Oops to Secure Ops: Self-Hosted AI for Airflow Failure Diagnosis

2025-07-01

session

Nathan Hadfield

AI/ML Airflow API Cloud Computing LLM

Last year, ‘From Oops to Ops’ showed how AI-powered failure analysis could help diagnose why Airflow tasks fail. But do we really need large, expensive cloud-based AI models to answer simple diagnostic questions? Relying on external AI APIs introduces privacy risks, unpredictable costs, and latency, often without clear benefits for this use case. With the rise of distilled, open-source models, self-hosted failure analysis is now a practical alternative. This talk will explore how to deploy an AI service on infrastructure you control, compare cost, speed, and accuracy between OpenAI’s API and self-hosted models, and showcase a live demo of AI-powered task failure diagnosis using DeepSeek and Llama—running without external dependencies to keep data private and costs predictable.

From Repetition to Refactor: Smarter DAG Design in Airflow 3

2025-07-01

session

Steven Woods

Airflow Data Engineering Data Streaming

We will explore how Apache Airflow 3 unlocks new possibilities for smarter, more flexible DAG design. We’ll start by breaking down common anti-patterns in early DAG implementations, such as hardcoded operators, duplicated task logic, and rigid sequencing, that lead to brittle, unscalable workflows. From there, we’ll show how refactoring with the D.R.Y. (Don’t Repeat Yourself) principle, using techniques like task factories, parameterization, dynamic task mapping, and modular DAG construction, transforms these workflows into clean, reusable patterns. With Airflow 3, these strategies go further: enabling DAGs that are reusable across both batch pipelines and streaming/event-driven workloads, while also supporting ad-hoc runs for testing, one-off jobs, or backfills. The result is not just more concise code, but workflows that can flexibly serve different data processing modes without duplication. Attendees will leave with concrete patterns and best practices for building maintainable, production-grade DAGs that are scalable, observable, and aligned with modern data engineering standards.

Get Certified: DAG Authoring for Apache Airflow 3

2025-07-01

session

Marc Lamberti

Airflow API Astronomer

We’re excited to offer Airflow Summit 2025 attendees an exclusive opportunity to earn their DAG Authoring certification in person, now updated to include all the latest Airflow 3.0 features. This certification workshop comes at no additional cost to summit attendees. The DAG Authoring for Apache Airflow certification validates your expertise in advanced Airflow concepts and demonstrates your ability to build production-grade data pipelines. It covers TaskFlow API, Dynamic task mapping, Templating, Asset-driven scheduling, Best practices for production DAGs, and new Airflow 3.0 features and optimizations. The certification session includes: 20-minute preparation period with expert guidance Live Q&A session with Marc Lamberti from Astronomer 60-minute examination period Real-time results and immediate feedback To prepare for the Airflow Certification, visit the Astronomer Academy ( https://academy.astronomer.io/page/astronomer-certification) .

Get started with Airflow 3.0

2025-07-01

session

Kenten Danas

AI/ML Airflow Astronomer ETL/ELT GenAI React

Airflow 3.0 is the most significant release in the project’s history, and brings a better user experience, stronger security, and the ability to run tasks anywhere, at any time. In this workshop, you’ll get hands-on experience with the new release and learn how to leverage new features like DAG versioning, backfills, data assets, and a new react-based UI. Whether you’re writing traditional ELT/ETL pipelines or complex ML and GenAI workflows, you’ll learn how Airflow 3 will make your day-to-day work smoother and your pipelines even more flexible. This workshop is suitable for intermediate to advanced Airflow users. Beginning users should consider taking the Airflow fundamentals course on the Astronomer Academy before attending this workshop.

GitHub's Airflow Journey: Lessons, Mistakes, and Insights

2025-07-01

session

Oleksandr Slynko

Airflow Cloud Computing Data Engineering GitHub

This session explores how GitHub uses Apache Airflow for efficient data engineering. We will share nearly 9 years of experiences, including lessons learnt, mistakes made, and the ways we reduced our on-call and engineering burden. We’ll demonstrate how we keep data flowing smoothly while continuously evolving Airflow and other components of our data platform, ensuring safety and reliability. The session will touch on how we migrate Airflow between cloud without user impact. We’ll also cover how we cut down the time from idea to running a DAG in production, despite our Airflow repo being among the top 15 by number of PRs within GitHub. We’ll dive into specific techniques such as testing connections and operators, relying on dag-sync, providing short-lived development environments to let developers test their DAG runs, and creating reusable patterns for DAGs. By the end of this session, you will gain practical insights and actionable strategies to improve your own data engineering processes.

How Airflow can help with Data Management and Governance

2025-07-01

session

Kunal Jain

Airflow Data Engineering Data Governance Data Management NoSQL Data Streaming

Metadata management is a cornerstone of effective data governance, yet it presents unique challenges distinct from traditional data engineering. At scale, efficiently extracting metadata from relational and NoSQL databases demands specialized solutions. To address this, our team has developed custom Airflow operators that scan and extract metadata across various database technologies, orchestrating 100+ production jobs to ensure continuous and reliable metadata collection. Now, we’re expanding beyond databases to tackle non-traditional data sources such as file repositories and message queues. This shift introduces new complexities, including processing structured and unstructured files, managing schema evolution in streaming data, and maintaining metadata consistency across heterogeneous sources. In this session, we’ll share our approach to building scalable metadata scanners, optimizing performance, and ensuring adaptability across diverse data environments. Attendees will gain insights into designing efficient metadata pipelines, overcoming common pitfalls, and leveraging Airflow to drive metadata governance at scale.

How Airflow Runs The Weather

2025-07-01

session

Eloi Codina Torras

Airflow

Forecasting the weather and air quality is a logistical challenge. Numerical simulations are complex, resource-hungry, and sometimes fail without warning. Yet, our clients depend on accurate forecasts delivered daily and on time. At the heart of this operation is Airflow: the orchestration engine that keeps everything running. In this session, we’ll dive into the world behind weather and air quality forecasts. In particular, we’ll explore: The atmospheric modeling pipeline, to understand the unique demands it places on infrastructure How we use Airflow to orchestrate complex simulations reliably and at scale, to inspire new ways of managing time-critical, compute-heavy workflows. Our integration of Airflow with a high-performance computing (HPC) environment using Slurm, to run resource-intensive workloads efficiently in bare metal machines. At Meteosim we are experts on weather and air quality intelligence. With projects in over 80 countries, we support decision-making in industries where weather and air quality matter most: from daily operations to long-term sustainability.

talk-data.com

Top Topics

Top Speakers

Driving Analytics with Open Source: Airbyte, dbt, Airflow & Metabase

Dynamic DAGs and Data Quality using DAGFactory

Dynamic Data Pipelines with DBT and Airflow

EdgeExecutor / Edge Worker - The new option to run anywhere

ELT and Elections: Cloud-agnostic patterns for real-time analysis

Empowering Precision Healthcare with Apache Airflow-iKang Healthcare Group’s DataHub Journey

Enabling SQL testing in Airflow workflows using Pydantic types

Enhancing Airflow REST API: From Basic Integration to Enterprise Scale

Enhancing DAG Management with DMS: A Scalable Solution for Airflow

Enhancing Small Retailer Visibility: Machine Learning Pipelines with Apache Airflow

Enterprise Auditing: "The Verifiable Data Pipeline"

Event-Driven Airflow 3.0: Real-Time Orchestration with Pub/Sub

Fine-Tuning Airflow: Parameters You May Not Know About

From Centrailization to Autonomy: Managing Airflow Pipeline through Multi-Tenancy

From Complexity to Simplicity with TaskHarbor: Trendyol's Path to a Unified Orchestration Platform

From Cron to Data-Aware: Evolving Airflow Scheduling at Scale

From DAGs to Insights: Business-Driven Airflow Use Cases

From Legacy to Leading Edge: How Airflow Migration Unlocked Cross-Team Business Value

From Oops to Secure Ops: Self-Hosted AI for Airflow Failure Diagnosis

From Repetition to Refactor: Smarter DAG Design in Airflow 3

Get Certified: DAG Authoring for Apache Airflow 3

Get started with Airflow 3.0

GitHub's Airflow Journey: Lessons, Mistakes, and Insights

How Airflow can help with Data Management and Governance

How Airflow Runs The Weather