talk-data.com talk-data.com

Event

Airflow Summit 2025

2025-07-01 Airflow Summit Visit website ↗

Activities tracked

24

Airflow Summit 2025 program

Filtering by: Cloud Computing ×

Sessions & talks

Showing 1–24 of 24 · Newest first

Search within this event →

Airflow as an AI Agent's toolkit: Going beyond MCPs and unlocking Airflow's 1000+ Integrations

2025-07-01
session

What if your Airflow tasks could understand natural language AND adapt to schema changes automatically, while maintaining the deterministic, observable workflows we rely on? This talk introduces practical patterns for AI-native orchestration that preserve Airflow’s strengths while adding intelligence where it matters most. Through a real-world example, we’ll demonstrate AI-powered tasks that detect schema drift across multi-cloud systems and perform context-aware data quality checks that go beyond simple validation—understanding business rules, detecting anomalies, and generating validation queries from prompts like “check data quality across regions.” All within static DAG structures you can test and debug normally. We’ll show how AI becomes a first-class citizen by combining Airflow’s features, assets for schema context, Human-in-the-Loop for approvals, and AssetWatchers for automated triggers, with engines such as Apache DataFusion for high-performance query execution and support for cross-cloud data processing with unified access to multiple storage formats. These patterns apply directly to schema validation and similar cases where natural language can simplify complex operations. This isn’t about bolting AI onto Airflow. It’s about evolving how we build workflows, from brittle rules to intelligent adaptation, while keeping everything testable, auditable, and production-ready.

Allegro's Airflow Journey: From On-Prem to Cloud Orchestration at Scale

2025-07-01
session
Piotr Dziuba (Allegro) , Marek Gawiński (Allegro)

This session will detail Allegro’s, a leading e-commerce company in Poland, journey with Apache Airflow. It will chart our evolution from a custom, on-premises Airflow-as-a-Service solution through a significant expansion to over 300 Cloud Composer instances in Google Cloud, culminating in Airflow becoming the core of our data processing. We orchestrate over 64,000 regular tasks spanning over 6,000 active DAGs on more than 200 Airflow instances. From feeding business-supporting dashboards, to managing main data marts, and handling ML pipelines, and more. We will share our practical experiences, lessons learned, and the strategies employed to manage and scale this critical infrastructure. Furthermore, we will introduce our innovative economy-of-share approach for providing ready-to-use Airflow environments, significantly enhancing both user productivity and cost efficiency.

Boosting dbt-core workflows performance with Airflow’s Deferrable capabilities

2025-07-01
session

Efficiently handling long-running workflows is crucial for scaling modern data pipelines. Apache Airflow’s deferrable operators help offload tasks during idle periods — freeing worker slots while tracking progress. This session explores how Cosmos 1.9 ( https://github.com/astronomer/astronomer-cosmos ) integrates Airflow’s deferrable capabilities to enhance orchestrating dbt ( https://github.com/dbt-labs/dbt-core ) in production, with insights from recent contributions that introduced this functionality. Key takeaways: Deferrable Operators: How they work and why they’re ideal for long-running dbt tasks. Integrating with Cosmos: Refactoring and enhancements to enable deferrable behaviour across platforms. Performance Gains: Resource savings and task throughput improvements from deferrable execution. Challenges & Future Enhancements: Lessons learned, compatibility, and ideas for broader support. Whether orchestrating dbt models on a cloud warehouse or managing large-scale transformations, this session offers practical strategies to reduce resource contention and boost pipeline performance.

Building Airflow 3 setups resilient to zonal/regional down events, ready for Disaster Recovery event

2025-07-01
session

Want to be resilient to any zonal/regional down events when building Airflow in a cloud environment? Unforeseen disruptions in cloud infrastructure, whether isolated to specific zones or impacting entire regions, pose a tangible threat to the continuous operation of critical data workflows managed by Airflow. These outages, though often technical in nature, translate directly into real-world consequences, potentially causing interruptions in essential services, delays in crucial information delivery, and ultimately impacting the reliability and efficiency of various operational processes that businesses and individuals depend upon daily. The inability to process data reliably due to infrastructure instability can cascade into tangible setbacks across diverse sectors, highlighting the urgent need for resilient and robust Airflow deployments. Let’s dive deep into strategies for building truly resilient Airflow setups that can withstand zonal and even regional down events. We’ll explore architectural patterns like multi-availability zone deployments, cross-region failover mechanisms, and robust data replication techniques to minimise downtime and ensure business continuity. Discover practical tips and best practices for having a resilient Airflow infrastructure. By attending this presentation, you’ll gain the knowledge and tools necessary to significantly improve the reliability and stability of your critical data pipelines, ultimately saving time, resources, and preventing costly disruptions.

Common provider abstractions: Key for multi-cloud data handling

2025-07-01
session
Vikram Koka (Astronomer)

Enterprises want the flexibility to operate across multiple clouds, whether to optimize costs, improve resiliency, to avoid vendor lock-in, or for data sovereignty. But for developers, that flexibility usually comes at the cost of extra complexity and redundant code. The goal here is simple: write once, run anywhere, with minimum boilerplate. In Apache Airflow, we’ve already begun tackling this problem with abstractions like Common-SQL, which lets you write database queries once and run them on 20+ databases, from Snowflake to Postgres to SQLite to SAP HANA. Similarly, Common-IO standardizes cloud blob storage interactions across all public clouds. With Airflow 3.0, we are pushing this further by introducing a Common Message Bus provider, which is an abstraction, initially supporting Amazon SQS and expanding to Google PubSub and Apache Kafka soon after. We expect additional implementations such as Amazon Kinesis and Managed Kafka over time. This talk will dive into why these abstractions matter, how they reduce friction for developers while giving enterprises true multi-cloud optionality, and what’s next for Airflow’s evolving provider ecosystem.

Data Quality and Observability with Airflow

2025-07-01
session

Tekmetric is the largest cloud based auto shop management system in the United States. We process vast amounts of data from various integrations with internal and external systems. Data quality and governance are crucial for both our internal operations and the success of our customers. We leverage multi-step data processing pipelines using AWS services and Airflow. While we utilize traditional data pipeline workflows to manage and move data, we go beyond standard orchestration. After data is processed, we apply tailored quality checks for schema validation, record completeness, freshness, duplication and more. In this talk, we’ll explore how Airflow allows us to enhance data observability. We’ll discuss how Airflow’s flexibility enables seamless integration and monitoring across different teams and datasets, ensuring reliable and accurate data at every stage. This session will highlight how Tekmetric uses data quality governance and observability practices to drive business success through trusted data.

Dynamic DAGs and Data Quality using DAGFactory

2025-07-01
session

We have a similar pattern of DAGs running for different data quality dimensions like accuracy, timeliness, & completeness. To do this again and again, we would be duplicating and potentially introducing human error while doing copy paste of code or making people write same code again. To solve for this, we are doing few things: Run DAGs via DagFactory to dynamically generate DAGs using just some YAML code for all the steps we want to run in our DQ checks. Hide this behind a UI which is hooked to github PR open step, now the user just provides some inputs or selects from dropdown in UI and a YAML DAG is generated for them. This highlights the potential for DAGFactory to hide Airflow Python code from users and make it more accessible to Data Analysts and Business Intelligence along with normal Software Engg, along with reducing human error. YAML is the perfect format to be able to generate code, create a PR and DagFactory is the perfect fir for that. All of this is running in GCP Cloud Composer.

EdgeExecutor / Edge Worker - The new option to run anywhere

2025-07-01
session

Airflow 3 extends the deployment options to run your workload anywhere. You don’t need to bring your data to airflow but you can bring the execution where it needs to be. You can connect any cloud and on-prem location together and generate a hybrid workflow from one central Airflow instance. Only a HTTP connection is needed. We will present the use cases and concepts of the Edge deployment and how it is working also in a hybrid setup with Celery or other executors.

ELT and Elections: Cloud-agnostic patterns for real-time analysis

2025-07-01
session

Discover how Apache Airflow powers scalable ELT pipelines, enabling seamless data ingestion, transformation, and machine learning-driven insights. This session will walk through: Automating Data Ingestion: Using Airflow to orchestrate raw data ingestion from third-party sources into your data lake (S3, GCP), ensuring a steady pipeline of high-quality training and prediction data. Optimizing Transformations with Serverless Computing: Offloading intensive transformations to serverless functions (GCP Cloud Run, AWS Lambda) and machine learning models (BigQuery ML, Sagemaker), integrating their outputs seamlessly into Airflow workflows. Real-World Impact: A case study on how INTRVL leveraged Airflow, BigQuery ML, and Cloud Run to analyze early voting data in near real-time, generating actionable insights on voter behavior across swing states. This talk not only provides a deep dive into the Political Tech space but also serves as a reference architecture for building robust, repeatable ELT pipelines. Attendees will gain insights into modern serverless technologies from AWS and GCP that enhance Airflow’s capabilities, helping data engineers design scalable, cloud-agnostic workflows.

Enterprise Auditing: "The Verifiable Data Pipeline"

2025-07-01
session

This session will dive deep into leveraging the robust logging and audit capabilities of Google Cloud Platform, Cloud Composer and Apache Airflow to establish a fully transparent and verifiable data orchestration layer. We’ll demonstrate how to track and attribute every change—from environment configuration to individual task execution—essential for meeting stringent enterprise governance, compliance, and auditing requirements.

Event-Driven Airflow 3.0: Real-Time Orchestration with Pub/Sub

2025-07-01
session

Traditional time-based scheduling in Airflow can lead to inefficiencies and delays. With Airflow 3.0, we can now leverage native event-driven DAG execution, enabling workflows to trigger instantly when data arrives—eliminating polling-based sensors and rigid schedules. This talk explores real-time orchestration using Airflow 3.0 and Google Cloud Pub/Sub. We’ll showcase how to build an event-driven pipeline where DAGs automatically trigger as new data lands, ensuring faster and more efficient processing. Through a live demo, we’ll demonstrate how Airflow listens to Pub/Sub messages and dynamically triggers dbt transformations only when fresh data is available. This approach improves scalability, reduces costs, and enhances orchestration efficiency. Key Takeaways: How event-driven DAGs work vs. traditional scheduling, Best practices for integrating Airflow with Pub/Sub,Eliminating polling-based sensors for efficiency,Live demo: Event-driven pipeline with Airflow 3.0, Pub/Sub & dbt. This session will showcase how Airflow 3.0 enables truly real-time orchestration.

From Complexity to Simplicity with TaskHarbor: Trendyol's Path to a Unified Orchestration Platform

2025-07-01
session

At Trendyol, Turkey’s leading e-commerce company, Apache Airflow powers our task orchestration, handling DAGs with 500+ tasks, complex interdependencies, and diverse environments. Managing on-prem Airflow instances posed challenges in scalability, maintenance, and deployment. To address these, we built TaskHarbor, a fully managed orchestration platform with a hybrid architecture—combining Airflow on GKE with on-prem resources for optimal performance and efficiency. This talk covers how we: Enabled seamless DAG synchronization across environments using GCS Fuse. Optimized workload distribution via GCP’s HTTPS & TCP Load Balancers. Automated infrastructure provisioning (GKE, CloudSQL, Kubernetes) using Terraform. Simplified Airflow deployments by replacing Helm YAML files with a custom templating tool, reducing configurations to 10-15 lines. Built a fully automated deployment pipeline, ensuring zero developer intervention. We enhanced efficiency, reliability, and automation in hybrid orchestration by embracing a scalable, maintainable, and cloud-native strategy. Attendees will obtain practical insights into architecting Airflow at scale and optimizing deployments.

From Legacy to Leading Edge: How Airflow Migration Unlocked Cross-Team Business Value

2025-07-01
session

At TrueCar, migrating hundreds of legacy workflows from in-house orchestration tools to Apache Airflow required key technical decisions that transformed our data platform architecture and organizational capabilities. We consolidated individual chained tasks into optimized DAGs leveraging native Airflow functionality to trigger compute across cloud environments. A crucial breakthrough was developing DAG generators to scale migration—essential for efficiently migrating hundreds of workflows while maintaining consistency. By decoupling orchestration from compute, we gained flexibility to select optimal tools for specific outcomes—programmatic processing, analytics, batch jobs, or AI/ML pipelines. This resulted in cost reductions, performance improvements, and team agility. We also gained unprecedented visibility into DAG performance and dependency patterns previously invisible across fragmented systems. Attendees will learn how we redesigned complex workflows into efficient DAGs using dynamic task generation, architectural decisions that enabled platform innovation and the decision framework that made our migration transformational.

From Oops to Secure Ops: Self-Hosted AI for Airflow Failure Diagnosis

2025-07-01
session

Last year, ‘From Oops to Ops’ showed how AI-powered failure analysis could help diagnose why Airflow tasks fail. But do we really need large, expensive cloud-based AI models to answer simple diagnostic questions? Relying on external AI APIs introduces privacy risks, unpredictable costs, and latency, often without clear benefits for this use case. With the rise of distilled, open-source models, self-hosted failure analysis is now a practical alternative. This talk will explore how to deploy an AI service on infrastructure you control, compare cost, speed, and accuracy between OpenAI’s API and self-hosted models, and showcase a live demo of AI-powered task failure diagnosis using DeepSeek and Llama—running without external dependencies to keep data private and costs predictable.

GitHub's Airflow Journey: Lessons, Mistakes, and Insights

2025-07-01
session

This session explores how GitHub uses Apache Airflow for efficient data engineering. We will share nearly 9 years of experiences, including lessons learnt, mistakes made, and the ways we reduced our on-call and engineering burden. We’ll demonstrate how we keep data flowing smoothly while continuously evolving Airflow and other components of our data platform, ensuring safety and reliability. The session will touch on how we migrate Airflow between cloud without user impact. We’ll also cover how we cut down the time from idea to running a DAG in production, despite our Airflow repo being among the top 15 by number of PRs within GitHub. We’ll dive into specific techniques such as testing connections and operators, relying on dag-sync, providing short-lived development environments to let developers test their DAG runs, and creating reusable patterns for DAGs. By the end of this session, you will gain practical insights and actionable strategies to improve your own data engineering processes.

Learn from Deutsche Bank: Using Apache Airflow in Regulated Environments

2025-07-01
session
Christian Foernges (Deutsche Bank)

Operating within the stringent regulatory landscape of Corporate Banking, Deutsche Bank relies heavily on robust data orchestration. This session explores how Deutsche Bank’s Corporate Bank leverages Apache Airflow across diverse environments, including both on-premises infrastructure and cloud platforms. Discover their approach to managing critical data & analytics workflows, encompassing areas like regulatory reporting, data integration and complex data processing pipelines. Gain insights into the architectural patterns and operational best practices employed to ensure compliance, security, and scalability when running Airflow at scale in a highly regulated, hybrid setting.

Lessons learned for scaling up Airflow 3 in Public Cloud

2025-07-01
session

Apache Airflow 3 is a new state-of-the-art version of Airflow. For many users who plan to adopt Airflow 3 it’s important to understand how Airflow 3 behaves from performance perspective compared to Airflow 2. This presentation is going to present performance results for various Airflow 3 configurations and provides potential Airflow 3 adopters good understanding of its performance. The reference Airflow 3 configuration will be using Kubernetes cluster as a compute layer, PostgreSQL as Airflow Database and would be performed on Google Cloud Platform. Performance tests will be performed using community version of performance tests framework and there might be references to Cloud Composer (managed service for Apache Airflow). The tests will be done in production-grade configurations that might be good references for Airflow community users. Users will be provided with comparison of Airflow 3 and Airflow 2 from performance standpoint Users also will learn how to optimize Airflow scheduler performance by understanding DAG file processing, task scheduling and configuring Scheduler to run tens of thousands of DAGs/tasks in Airflow 3

Modernizing Automation in Secure, Regulated Environments: Lessons from Deploying Airflow

2025-07-01
session

This session details practical strategies for introducing Apache Airflow in strict, compliance-heavy organizations. Learn how on-premise deployment and hybrid tooling can help modernize legacy workflows when public cloud solutions and container technologies are restricted. Discover how cross-platform engineering teams can collaborate securely using CI/CD bridges, and what it takes to meet rigorous security and governance standards. Key lessons address navigating resistance to change, achieving production sign-off, and avoiding common compliance pitfalls, relevant to anyone automating in public sector settings.

Orchestrating Travel Insights: Priceline's MLOps with Airflow

2025-07-01
session

The journey from ML model development to production deployment and monitoring is often complex and fragmented. How can teams overcome the chaos of disparate tools and processes? This session dives into how Apache Airflow serves as a unifying force in MLOps. We’ll begin with a look at the broader MLOps trends observed by Google within the Airflow community, highlighting how Airflow is evolving to meet these challenges and showcasing diverse MLOps use cases – both current and future. Then, Priceline will present a deep-dive case study on their MLOps transformation. Learn how they leveraged Cloud Composer, Google Cloud’s managed Apache Airflow service, to orchestrate their entire ML pipeline end-to-end: ETL, data preprocessing, model building & training, Dockerization, Google Artifact Registry integration, deployment, model serving, and evaluation. Discover how using Cloud Composer on GCP enabled them to build a scalable, reliable, adaptable, and maintainable MLOps practice, moving decisively from chaos to coordination. Cloud Composer (Airflow) has served as a major backbone in transforming the whole ML experience in Priceline. Join us to learn how to harness Airflow, particularly within a managed environment like Cloud Composer, for robust MLOps workflows, drawing lessons from both industry trends and a concrete, successful implementation.

Orchestrator of Orchestrators: Uniting Airflow Pipelines with Business Applications in Production

2025-07-01
session

Airflow powers thousands of data and ML pipelines—but in the enterprise, these pipelines often need to interact with business-critical systems like ERPs, CRMs, and core banking platforms. In this demo-driven session we will connect Airflow with Control-M from BMC and showcase how Airflow can participate in end-to-end workflows that span not just data platforms but also transactional business applications. Session highlights Trigger Airflow DAGs based on business events (e.g., invoice approvals, trade settlements) Feed Airflow pipeline outputs into ERP systems (e.g., SAP) or CRMs (e.g., Salesforce) Orchestrate multi-platform workflows from cloud to mainframe with SLA enforcement, dependency management, and centralized control. Provide unified monitoring and auditing across data and application layers

Orchestrator of Orchestrators: Uniting Airflow Pipelines with Business Applications in Production

2025-07-01
session

Airflow powers thousands of data and ML pipelines—but in the enterprise, these pipelines often need to interact with business-critical systems like ERPs, CRMs, and core banking platforms. In this demo-driven session we will connect Airflow with Control-M from BMC and showcase how Airflow can participate in end-to-end workflows that span not just data platforms but also transactional business applications. Session highlights Trigger Airflow DAGs based on business events (e.g., invoice approvals, trade settlements) Feed Airflow pipeline outputs into ERP systems (e.g., SAP) or CRMs (e.g., Salesforce) Orchestrate multi-platform workflows from cloud to mainframe with SLA enforcement, dependency management, and centralized control. Provide unified monitoring and auditing across data and application layers

Scaling ML Infrastructure: Lessons from Building Distributed Systems

2025-07-01
session

In today’s data-driven world, scalable ML infrastructure is mission-critical. As ML workloads grow, orchestration tools like Apache Airflow become essential for managing pipelines, training, deployment, and observability. In this talk, I’ll share lessons from building distributed ML systems across cloud platforms, including GPU-based training and AI-powered healthcare. We’ll cover patterns for scaling Airflow DAGs, integrating telemetry and auto-healing, and aligning cross-functional teams. Whether you’re launching your first pipeline or managing ML at scale, you’ll gain practical strategies to make Airflow the backbone of your ML infrastructure.