talk-data.com talk-data.com

Event

Airflow Summit 2025

2025-07-01 Airflow Summit Visit website ↗

Activities tracked

140

Airflow Summit 2025 program

Sessions & talks

Showing 101–125 of 140 · Newest first

Search within this event →

Orchestrating AI Knowledge Bases with Apache Airflow

2025-07-01
session

In the age of Generative AI, knowledge bases are the backbone of intelligent systems, enabling them to deliver accurate and context-aware responses. But how do you ensure that these knowledge bases remain up-to-date and relevant in a rapidly changing world? Enter Apache Airflow, a robust orchestration tool that streamlines the automation of data workflows. This talk will explore how Airflow can be leveraged to manage and update AI knowledge bases across multiple data sources. We’ll dive into the architecture, demonstrate how Airflow enables efficient data extraction, transformation, and loading (ETL), and share insights on tackling challenges like data consistency, scheduling, and scalability. Whether you’re building your own AI-driven systems or looking to optimize existing workflows, this session will provide practical takeaways to make the most of Apache Airflow in orchestrating intelligent solutions.

Orchestrating Apache Airflow ML Workflows at Scale with SageMaker Unified Studio

2025-07-01
session

As organizations increasingly rely on data-driven applications, managing the diverse tools, data, and teams involved can create challenges. Amazon SageMaker Unified Studio addresses this by providing an integrated, governed platform to orchestrate end-to-end data and AI/ML workflows. In this workshop, we’ll explore how to leverage Amazon SageMaker Unified Studio to build and deploy scalable Apache Airflow workflows that span the data and AI/ML lifecycle. We’ll walk through real-world examples showcasing how this AWS service brings together familiar Airflow capabilities with SageMaker’s data processing, model training, and inference features - all within a unified, collaborative workspace. Key topics covered: Authoring and scheduling Airflow DAGs in SageMaker Unified Studio Understanding how Apache Airflow powers workflow orchestration under the hood Leveraging SageMaker capabilities like Notebooks, Data Wrangler, and Models Implementing centralized governance and workflow monitoring Enhancing productivity through unified development environments Join us to transform your ML workflow experience from complex and fragmented to streamlined and efficient.

Orchestrating Databricks with Airflow: Unlocking the Power of MVs, Streaming Tables, and AI

2025-07-01
session
Tahir Fayyaz (/ Google Cloud Platform Team specialising in Data & Machine Learning, BigQuery expert) , Shanelle Roman

As data workloads grow in complexity, teams need seamless orchestration to manage pipelines across batch, streaming, and AI/ML workflows. Apache Airflow provides a flexible and open-source way to orchestrate Databricks’ entire platform, from SQL analytics with Materialized Views (MVs) and Streaming Tables (STs) to AI/ML model training and deployment. In this session, we’ll showcase how Airflow can automate and optimize Databricks workflows, reducing costs and improving performance for large-scale data processing. We’ll highlight how MVs and STs eliminate manual incremental logic, enable real-time ingestion, and enhance query performance—all while maintaining governance and flexibility. Additionally, we’ll demonstrate how Airflow simplifies ML model lifecycle management by integrating Databricks’ AI/ML capabilities into end-to-end data pipelines. Whether you’re a dbt user seeking better performance, a data engineer managing streaming pipelines, or an ML practitioner scaling AI workloads, this session will provide actionable insights on using Airflow and Databricks together to build efficient, cost-effective, and future-proof data platforms.

Orchestrating Databricks with Airflow: Unlocking the Power of MVs, Streaming Tables, and AI

2025-07-01
session
Tahir Fayyaz (/ Google Cloud Platform Team specialising in Data & Machine Learning, BigQuery expert) , Shanelle Roman

As data workloads grow in complexity, teams need seamless orchestration to manage pipelines across batch, streaming, and AI/ML workflows. Apache Airflow provides a flexible and open-source way to orchestrate Databricks’ entire platform, from SQL analytics with Materialized Views (MVs) and Streaming Tables (STs) to AI/ML model training and deployment. In this session, we’ll showcase how Airflow can automate and optimize Databricks workflows, reducing costs and improving performance for large-scale data processing. We’ll highlight how MVs and STs eliminate manual incremental logic, enable real-time ingestion, and enhance query performance—all while maintaining governance and flexibility. Additionally, we’ll demonstrate how Airflow simplifies ML model lifecycle management by integrating Databricks’ AI/ML capabilities into end-to-end data pipelines. Whether you’re a dbt user seeking better performance, a data engineer managing streaming pipelines, or an ML practitioner scaling AI workloads, this session will provide actionable insights on using Airflow and Databricks together to build efficient, cost-effective, and future-proof data platforms.

Orchestrating Data Quality - Quality Data Brought To You By Airflow

2025-07-01
session

Ensuring high-quality data is essential for building user trust and enabling data teams to work efficiently. In this talk, we’ll explore how the Astronomer data team leverages Airflow to uphold data quality across complex pipelines; minimizing firefighting and maximizing confidence in reported metrics. Maintaining data quality requires a multi-faceted approach: safeguarding the integrity of source data, orchestrating pipelines reliably, writing robust code, and maintaining consistency in outputs. We’ve embedded data quality into the DevEx experience, so it’s always at the forefront instead of in the backlog of tech debt. We’ll share how we’ve operationalized: Implementing data contracts to define and enforce expectations Differentiating between critical (pipeline-blocking) and non-critical (soft) failures Exposing upstream data issues to domain owners Tracking metrics to measure overall data quality of our team Join us to learn practical strategies for building scalable, trustworthy data systems powered by Airflow.

Orchestrating MLOps and Data Transformation at EDB with Airflow

2025-07-01
session

This talk explores EDB’s journey from siloed reporting to a unified data platform, powered by Airflow. We’ll delve into the architectural evolution, showcasing how Airflow orchestrates a diverse range of use cases, from Analytics Engineering to complex MLOps pipelines. Learn how EDB leverages Airflow and Cosmos to integrate dbt for robust data transformations, ensuring data quality and consistency. We’ll provide a detailed case study of our MLOps implementation, demonstrating how Airflow manages training, inference, and model monitoring pipelines for Azure Machine Learning models. Discover the design considerations driven by our internal data governance framework and gain insights into our future plans for AIOps integration with Airflow.

Orchestrating Travel Insights: Priceline's MLOps with Airflow

2025-07-01
session

The journey from ML model development to production deployment and monitoring is often complex and fragmented. How can teams overcome the chaos of disparate tools and processes? This session dives into how Apache Airflow serves as a unifying force in MLOps. We’ll begin with a look at the broader MLOps trends observed by Google within the Airflow community, highlighting how Airflow is evolving to meet these challenges and showcasing diverse MLOps use cases – both current and future. Then, Priceline will present a deep-dive case study on their MLOps transformation. Learn how they leveraged Cloud Composer, Google Cloud’s managed Apache Airflow service, to orchestrate their entire ML pipeline end-to-end: ETL, data preprocessing, model building & training, Dockerization, Google Artifact Registry integration, deployment, model serving, and evaluation. Discover how using Cloud Composer on GCP enabled them to build a scalable, reliable, adaptable, and maintainable MLOps practice, moving decisively from chaos to coordination. Cloud Composer (Airflow) has served as a major backbone in transforming the whole ML experience in Priceline. Join us to learn how to harness Airflow, particularly within a managed environment like Cloud Composer, for robust MLOps workflows, drawing lessons from both industry trends and a concrete, successful implementation.

Orchestrator of Orchestrators: Uniting Airflow Pipelines with Business Applications in Production

2025-07-01
session

Airflow powers thousands of data and ML pipelines—but in the enterprise, these pipelines often need to interact with business-critical systems like ERPs, CRMs, and core banking platforms. In this demo-driven session we will connect Airflow with Control-M from BMC and showcase how Airflow can participate in end-to-end workflows that span not just data platforms but also transactional business applications. Session highlights Trigger Airflow DAGs based on business events (e.g., invoice approvals, trade settlements) Feed Airflow pipeline outputs into ERP systems (e.g., SAP) or CRMs (e.g., Salesforce) Orchestrate multi-platform workflows from cloud to mainframe with SLA enforcement, dependency management, and centralized control. Provide unified monitoring and auditing across data and application layers

Orchestrator of Orchestrators: Uniting Airflow Pipelines with Business Applications in Production

2025-07-01
session

Airflow powers thousands of data and ML pipelines—but in the enterprise, these pipelines often need to interact with business-critical systems like ERPs, CRMs, and core banking platforms. In this demo-driven session we will connect Airflow with Control-M from BMC and showcase how Airflow can participate in end-to-end workflows that span not just data platforms but also transactional business applications. Session highlights Trigger Airflow DAGs based on business events (e.g., invoice approvals, trade settlements) Feed Airflow pipeline outputs into ERP systems (e.g., SAP) or CRMs (e.g., Salesforce) Orchestrate multi-platform workflows from cloud to mainframe with SLA enforcement, dependency management, and centralized control. Provide unified monitoring and auditing across data and application layers

Pittsburgh Goes With The Flow - Use Cases In Local Government

2025-07-01
session

The City of Pittsburgh utilizes Airflow (via Astronomer) for a wide variety of tasks. From employee-focused use cases, like time bank balancing and internal dashboards, to public-facing publication, the City’s data flows through our DAGs from many sources to many sources. Airflow acts as a funnel point and is an essential tool for Pittsburgh’s Data Services team.

Productionising dbt-core with Airflow

2025-07-01
session

As a popular open-source library for analytics engineering, dbt is often combined with Airflow. Orchestrating and executing dbt models as DAGs ensures an additional layer of control over tasks, observability, and provides a reliable, scalable environment to run dbt models. This workshop will cover a step-by-step guide to Cosmos , a popular open-source package from Astronomer that helps you quickly run your dbt Core projects as Airflow DAGs and Task Groups, all with just a few lines of code. We’ll walk through: Running and visualising your dbt transformations Managing dependency conflicts Defining database credentials (profiles) Configuring source and test nodes Using dbt selectors Customising arguments per model Addressing performance challenges Leveraging deferrable operators Visualising dbt docs in the Airflow UI Example of how to deploy to production Troubleshooting We encourage participants to bring their dbt project to follow this step-by-step workshop.

Purple is the new green: harnessing deferrable operators to improve performance & reduce costs

2025-07-01
session

Airflow’s traditional execution model often leads to wasted resources: worker nodes sitting idle, waiting on external systems. At Wix, we tackled this inefficiency head-on by refactoring our in-house operators to support Airflow’s deferrable execution model. Join us on a walk through Wix’s journey to a more efficient Airflow setup, from identifying bottlenecks to implementing deferrable operators and reaping their benefits. We’ll share the alternatives considered, the refactoring process, and how the team seamlessly integrated deferrable execution with no disruption to data engineers’ workflows. Attendees will get a practical introduction to deferrable operators: how they work, what they require, and how to implement them. We’ll also discuss the changes made to Wix’s Airflow environment, the process of prioritizing operators for modification, and lessons learned from testing and rollout. By the end of this talk, attendees will be ready to embrace more purple tasks in their Airflow UI, boosting efficiency, cutting costs, and making their workflows greener in every other way.

Run Airflow tasks on your coffee machine

2025-07-01
session

Airflow 3 comes with two new features: Edge execution and the task SDK. Powered by a HTTP API, these make it possible to write and execute Airflow tasks in any language from anywhere. In this session I will explain some of the APIs needed and show how to interact with them based on an embedded toy worker written in Rust and running on an ESP32-C3. Furthermore I will provide practical tips on writing your own edge worker and how to develop against a running instance of Airflow.

Scaling Airflow at OpenAI

2025-07-01
session

This talk shares how we scaled and hardened OpenAI’s Airflow deployment to orchestrate thousands of workflows on Kubernetes. We’ll cover key architecture choices, scaling strategies, and reliability improvements - along with practical lessons learned.

Scaling and Unifying Multiple Airflow Instances with Orchestration Frederator

2025-07-01
session

In large organizations, multiple Apache Airflow instances often arise organically—driven by team-specific needs, distinct use cases, or tiered workloads. This fragmentation introduces complexity, operational overhead, and higher infrastructure costs. To address these challenges, we developed the “Orchestration Frederator,” a solution designed to unify and horizontally scale multiple Airflow deployments seamlessly. This session will detail our journey in implementing Orchestration Frederator, highlighting how we achieved: Horizontal Scalability: Seamlessly scaling Airflow across multiple instances without operational overhead. End-to-End Data Lineage: Constructing comprehensive data lineage across disparate Airflow deployments to simplify monitoring and debugging. Multi-Region Support: Introducing multi-region capabilities, enhancing reliability and disaster recovery. Unified Ecosystem: Consolidating previously fragmented Airflow environments into a cohesive orchestration platform. Join us to explore practical strategies, technical challenges, lessons learned, and best practices for enhancing scalability, reliability, and maintainability in large-scale Airflow deployments.

Scaling ML Infrastructure: Lessons from Building Distributed Systems

2025-07-01
session

In today’s data-driven world, scalable ML infrastructure is mission-critical. As ML workloads grow, orchestration tools like Apache Airflow become essential for managing pipelines, training, deployment, and observability. In this talk, I’ll share lessons from building distributed ML systems across cloud platforms, including GPU-based training and AI-powered healthcare. We’ll cover patterns for scaling Airflow DAGs, integrating telemetry and auto-healing, and aligning cross-functional teams. Whether you’re launching your first pipeline or managing ML at scale, you’ll gain practical strategies to make Airflow the backbone of your ML infrastructure.

Seamless Airflow Upgrades: Migrating from 2.x to 3

2025-07-01
session

Airflow 3 has officially arrived! In this session, we’ll start by discussing prerequisites for a smooth upgrade from Airflow 2.x to Airflow 3, including airflow version requirements, removing deprecated SubDAGs, and backing up and cleaning your metadata database prior to migration. We’ll then explore the new CLI utility: airflow config update [—-fix] for auto-applying configuration changes. We’ll demo cleaning old XCom data to speed up schema migration. During this session, attendees will learn to verify and adapt their pipelines for Airflow 3 using a Ruff-based upgrade utility. I will demo run ruff check dag/ –select AIR301 to surface scheduling issues, inspect fixes via ruff check dag/ –select AIR301 –show-fixes, and apply corrections with ruff check dag/ –select AIR301 –fix. We’ll also examine rules AIR302 for deprecated config and AIR303 for provider package migrations. By the end, your DAGs will pass all AIR3xx checks error-free. Join this session for live demos and practical examples that will empower you to confidently upgrade, minimise downtime, and achieve optimal performance in Airflow 3.

Seamless Integration: Building Applications That Leverage Airflow's Database Migration Framework

2025-07-01
session

This session presents a comprehensive guide to building applications that integrate with Apache Airflow’s database migration system. We’ll explore how to harness Airflow’s robust Alembic-based migration toolchain to maintain schema compatibility between Airflow and custom applications, enabling developers to create solutions that evolve alongside the Airflow ecosystem without disruption.

Seamless Migration: Leveraging Ruff for a Smooth Transition from Airflow 2 to Airflow 3

2025-07-01
session

Migrating from Airflow 2 to the newly released Airflow 3 may seem intimidating due to numerous breaking changes and the introduction of new features. Although a backward compatibility layer has been implemented and most of the existing dags should work fine, some features—such as subdags and execution_date—have been removed based on community consensus. To support this transition, we worked with Ruff to establish rules that automatically identify removed or deprecated features and even assist in fixing them. In this presentation, I will outline our current Ruff features, the migration rules from Airflow 2 to 3, and how this experience opens the door for us to promote best practices in Airflow through Ruff in the future. After this session, Airflow users will understand how Ruff can facilitate a smooth transition to Airflow 3. As developers of Airflow, we will delve into the details of how these migration rules were implemented and discuss how we can leverage this knowledge to introduce linting rules that encourage Airflow users to adopt best practices.

Securing Airflow CLI with API

2025-07-01
session

This talk will explore the key changes introduced by AIP-81, focusing on security enhancements and user experience improvements across the entire software development lifecycle. We will break down the technical advancements from both a security and usability perspective, addressing key questions for Apache Airflow users of all levels. Topics include and not limited to isolating CLI communication to enhance security via leveraging Role-Based Access Control (RBAC) within the API for secure database interactions, clearly defining local vs. remote command execution and future improvements.

Security made us do it: Airflow’s new Task Execution Architecture

2025-07-01
session

Airflow v2 architecture has strong coupling between the Airflow core & the User Code running in an Airflow task. This poses barriers in security, maintenance, and adoption. One such threat is that user code can access the source of truth of Airflow - the metadata DB and run any query against it! From a scalability angle, ‘n’ tasks create ‘n’ DB connections, limiting Airflow’s ability to scale effectively. To address this we proposed AIP-72 – a client-server model for task execution. The new architecture addresses several long-standing issues, including DB isolation from workers, dependency conflicts between Airflow core & workers, and ‘n’ number of DB connections.The new architecture has two parts: Execution API Server: Tasks no longer have direct DB access, they use this new slim, secure API Task SDK: A lightweight toolkit that lets you write tasks without drowning within Airflow’s codebase Beyond isolation and security, the redesign unlocks the ability for native multi-language task authoring support, and secure Remote Execution. Join us to explore how AIP-72 transforms Airflow task execution, paving the way for a more secure, flexible, and futuristic task orchestration!

Semiconductor (Chip) Design Workflow Orchestration with Airflow

2025-07-01
session

The design of Qualcomm’s Snapdragon System-On-Chip (SoCs) involves several hundred complex workflows orchestrated across multiple data centers, taking the design from RTL to GDS. In the Snapdragon Oryon Custom CPU team, we introduced Airflow about 2 years ago to orchestrate design, verification, emulation, CI/CD, and physical implementation of our CPUs. Use Case: • Standardization and Templatization: We standardize and templatize common workflows, allowing designers to verify their designs by customizing YAML parameters. • Custom Shell Operators: We created custom shell operators (tcshrc) to source project environments and work with internal tooling. • Smart Retries: We use pre/post-execute hooks to trigger smart retries on failure. • Dynamic Celery Workers: We auto-create Celery workers on the fly on our High-Performance Compute (HPC) clusters to launch and manage Electronic Design Automation (EDA) workloads. • Hybrid Executor Strategy: We use a hybrid executor strategy (CeleryExecutor and EdgeExecutor) to orchestrate tasks across multiple data centers. • EdgeExecutor for Remote Testing: We leverage EdgeExecutor to access post-silicon hardware in remote locations.

Simplifying DAG creation with an AI-powered IDE for Airflow

2025-07-01
session

As the demand for data products grows, data engineering teams face mounting pressure to deliver more and even faster, often becoming bottlenecks. Astro IDE changes the game. Astro IDE is an AI-powered code editor built for Apache Airflow. It helps data teams go from idea to production in minutes—generating production-ready DAGs, enabling in-browser testing, and integrating directly with Git. In this session, see how Astro IDE accelerates DAG creation, debugging, and deployment so data engineering teams can deliver more, 10x faster.

Simplifying DAG creation with an AI-powered IDE for Airflow

2025-07-01
session

As the demand for data products grows, data engineering teams face mounting pressure to deliver more and even faster, often becoming bottlenecks. Astro IDE changes the game. Astro IDE is an AI-powered code editor built for Apache Airflow. It helps data teams go from idea to production in minutes—generating production-ready DAGs, enabling in-browser testing, and integrating directly with Git. In this session, see how Astro IDE accelerates DAG creation, debugging, and deployment so data engineering teams can deliver more, 10x faster.

Simplifying Data Lineage: How OpenLineage Empowers Airflow and Beyond

2025-07-01
session

OpenLineage has simplified collecting lineage metadata across the data ecosystem by standardizing its representation in an extensible model. It enabled a whole ecosystem improving data pipeline reliability and ease of troubleshooting in production environments. In this talk, we’ll briefly introduce the OpenLineage model and explore how this metadata is collected from Airflow, Spark, dbt, and Flink. We’ll demonstrate how to extract valuable insights and outline practical benefits and common challenges when building ingestion, processing and storage for OpenLineage data. We will also briefly show how OpenLineage events can be used to observe data pipelines exhastively and the benefits that brings.