talk-data.com talk-data.com

Topic

Airflow

Apache Airflow

workflow_management data_orchestration etl

139

tagged

Activity Trend

157 peak/qtr
2020-Q1 2026-Q1

Activities

Showing filtered results

Filtering by: Airflow Summit 2025 ×

Enterprises want the flexibility to operate across multiple clouds, whether to optimize costs, improve resiliency, to avoid vendor lock-in, or for data sovereignty. But for developers, that flexibility usually comes at the cost of extra complexity and redundant code. The goal here is simple: write once, run anywhere, with minimum boilerplate. In Apache Airflow, we’ve already begun tackling this problem with abstractions like Common-SQL, which lets you write database queries once and run them on 20+ databases, from Snowflake to Postgres to SQLite to SAP HANA. Similarly, Common-IO standardizes cloud blob storage interactions across all public clouds. With Airflow 3.0, we are pushing this further by introducing a Common Message Bus provider, which is an abstraction, initially supporting Amazon SQS and expanding to Google PubSub and Apache Kafka soon after. We expect additional implementations such as Amazon Kinesis and Managed Kafka over time. This talk will dive into why these abstractions matter, how they reduce friction for developers while giving enterprises true multi-cloud optionality, and what’s next for Airflow’s evolving provider ecosystem.

Duolingo has built an internal tool DuoFactory to orchestrate AI generated content using Airflow. The tool has been used to generate example sentences per lesson, math exercises, and Duoradio lessons. The ecosystem is flexible for various company needs. Some of these use cases contain end to end generation where one click of a button generates content in app. We also have created a Workflow Builder to orchestrate and iterate on generative AI workflows by creating one-time DAG instances with a UI easy enough for non-engineers to use.

Custom operators are the secret weapon for solving Airflow’s unique & challenging orchestration problems. This session will cover: When to build custom operators vs. using existing solutions Architecture patterns for creating maintainable, reusable operators Live coding demonstration: Building a custom operator from scratch Real-world examples: How custom operators solve specific business challenges Through practical code examples and architecture patterns, attendees will walk away with the knowledge to implement custom operators that enhance their Airflow deployments. This session is ideal for experienced Airflow users looking to extend functionality beyond out-of-the-box solutions.

Maintaining consistency, code quality, and best practices for writing Airflow DAGs between teams and individual developers can be a significant challenge. Trying to achieve it using manual code reviews is both time-consuming and error-prone. To solve this at Next, we decided to build a custom, internally developed linting tool for Airflow DAGs, to help us evaluate their quality and uniformity - we call it - DAGLint. In this talk I am going to share why we chose to implement it, how we built it, and how we use it to elevate our code quality and standards throughout the entire Data engineering group. This tool supports our day-to-day development process, provides us with a visual analysis of the state of our entire code base, and allows our code reviews to focus on other code quality aspects. We can now easily identify deviations from our defined standards, promote consistency throughout our DAGs repository, and extend the tool with additional new standards introduced to our group. The talk will cover how you can implement similar solution in your own organization, we also published a blog post on it https://medium.com/apache-airflow/mastering-airflow-dag-standardization-with-pythons-ast-a-deep-dive-into-linting-at-scale-1396771a9b90

DAGnostics seamlessly integrates Airflow Cluster Policy hooks to enforce governance from local DAG authoring through CI pipelines to production runtime. Learn how it closes validation gaps, collapses feedback loops from hours to seconds, and ensures consistent policies across stages. We examine current runtime-only enforcement and fractured CI checks, then unveil our architecture: a pluggable policy registry via Airflow entry points, local static analysis for pre-commit validation, GitHub Actions CI integration, and runtime hook enforcement. See real-world use cases: alerting standards, resource quotas, naming conventions, and exemption handling. Next, dive into implementation: authoring policies in Python, auto-discovery, cross-environment enforcement, upstream contribution, and testing strategies. We share LinkedIn’s metrics—2,000+ DAG repos, 10,000+ daily executions supporting trunk-based development across isolated teams/use-cases, and 78% fewer runtime violations—and lessons learned scaling policy-as-code at enterprise scale. Leave with a blueprint to adopt DAGnostics and strengthen your Airflow governance while preserving full compatibility with existing systems.

As the adoption of Airflow increases within large enterprises to orchestrate their data pipelines, more than one team needs to create, manage, and run their workflows in isolation. With multi-tenancy not yet supported natively in Airflow, customers are adopting alternate ways to enable multiple teams to share infrastructure. In this session, we will explore how GoDaddy uses MWAA to build a Single Pane Airflow setup for multiple teams with a common observability platform, and how this foundation enables orchestration expansion beyond data workflows to AI workflows as well. We’ll discuss our roadmap for leveraging upcoming Airflow 3 features, including the task execution API for enhanced workflow management and DAG versioning capabilities for comprehensive auditing and governance. This session will help attendees gain insights into the use case, the solution architecture, implementation challenges and benefits, and our strategic vision for unified orchestration across data and AI workloads. Outline: About GoDaddy GoDaddy Data & AI Orchestration Vision Current State & Airflow Usage Airflow Monitoring & Observability Lessons Learned & Best Practices Airflow 3 Adoption

Tekmetric is the largest cloud based auto shop management system in the United States. We process vast amounts of data from various integrations with internal and external systems. Data quality and governance are crucial for both our internal operations and the success of our customers. We leverage multi-step data processing pipelines using AWS services and Airflow. While we utilize traditional data pipeline workflows to manage and move data, we go beyond standard orchestration. After data is processed, we apply tailored quality checks for schema validation, record completeness, freshness, duplication and more. In this talk, we’ll explore how Airflow allows us to enhance data observability. We’ll discuss how Airflow’s flexibility enables seamless integration and monitoring across different teams and datasets, ensuring reliable and accurate data at every stage. This session will highlight how Tekmetric uses data quality governance and observability practices to drive business success through trusted data.

Do you have a DAG that needs to be done by a certain time? Have you tried to use Airflow 2’s SLA feature and found it restrictive or complicated? You aren’t alone! Come learn about the all-new Deadline Alerts feature in Airflow 3.1 which replaces SLA. We will discuss how Deadline Alerts work and how they improve on the retired SLA feature. Then we will look at some examples of workflows you can build with the new feature, including some of the callback options and how they work, and finally looking ahead to some future use-cases of using Deadlines for Tasks and even Assets.

At SAP Business AI, we’ve transformed Retrieval-Augmented Generation (RAG) pipelines into enterprise-grade powerhouses using Apache Airflow. Our Generative AI Foundations Team developed a cutting-edge system that effectively grounds Large Language Models (LLMs) with rich SAP enterprise data. Powering Joule for Consultants, our innovative AI copilot, this pipeline manages the seamless ingestion, sophisticated metadata enrichment, and efficient lifecycle management of over a million structured and unstructured documents. By leveraging Airflow’s Dynamic DAGs, TaskFlow API, XCom, and Kubernetes Event-Driven Autoscaling (KEDA), we achieved unprecedented scalability and flexibility. Join our session to discover actionable insights, innovative scaling strategies, and a forward-looking vision for Pipeline-as-a-Service, empowering seamless integration of customer-generated content into scalable AI workflows

Airflow is wonderfully, frustratingly complex - and so is global finance! Stripe has very specific needs all over the planet, and we have customized Airflow to adapt to the variety and rigor that we need to grow the GDP of the internet. In this talk, you’ll learn: How we support independent DAG change management for over 500 different teams running over 150k tasks. How we’ve customized Airflow’s Kubernetes integration to comply with Stripe’s unique compliance requirements. How we’ve built on Airflow to support no-code data pipelines.

In this talk, I’ll walk through how we built an end-to-end analytics pipeline using open-source tools ( Airbyte, dbt, Airflow, and Metabase). At WirePick, we extract data from multiple sources using Airbyte OSS into PostgreSQL, transform it into business-specific data marts with dbt, and automate the entire workflow using Airflow. Our Metabase dashboards provide real-time insights, and we integrate Slack notifications to alert stakeholders when key business metrics change. This session will cover: Data extraction: Using Airbyte OSS to pull data from multiple sources Transformation & Modeling: How dbt helps create reusable data marts Automation & Orchestration: Managing the workflow with Airflow Data-driven decision-making: Delivering insights through Metabase & Slack alerts

We have a similar pattern of DAGs running for different data quality dimensions like accuracy, timeliness, & completeness. To do this again and again, we would be duplicating and potentially introducing human error while doing copy paste of code or making people write same code again. To solve for this, we are doing few things: Run DAGs via DagFactory to dynamically generate DAGs using just some YAML code for all the steps we want to run in our DQ checks. Hide this behind a UI which is hooked to github PR open step, now the user just provides some inputs or selects from dropdown in UI and a YAML DAG is generated for them. This highlights the potential for DAGFactory to hide Airflow Python code from users and make it more accessible to Data Analysts and Business Intelligence along with normal Software Engg, along with reducing human error. YAML is the perfect format to be able to generate code, create a PR and DagFactory is the perfect fir for that. All of this is running in GCP Cloud Composer.

This session showcases Okta’s innovative approach to data pipeline orchestration with dbt and Airflow. How we’ve implemented dynamically generated airflow dags workflows based on dbt’s dependency graph. This allows us to enforce strict data quality standards by automatically executing downstream model tests before upstream model deployments, effectively preventing error cascades. The entire CI/CD pipeline, from dbt model changes to production DAG deployment, is fully automated. The result? Accelerated development cycles, reduced operational overhead, and bulletproof data reliability

Airflow 3 extends the deployment options to run your workload anywhere. You don’t need to bring your data to airflow but you can bring the execution where it needs to be. You can connect any cloud and on-prem location together and generate a hybrid workflow from one central Airflow instance. Only a HTTP connection is needed. We will present the use cases and concepts of the Edge deployment and how it is working also in a hybrid setup with Celery or other executors.

Discover how Apache Airflow powers scalable ELT pipelines, enabling seamless data ingestion, transformation, and machine learning-driven insights. This session will walk through: Automating Data Ingestion: Using Airflow to orchestrate raw data ingestion from third-party sources into your data lake (S3, GCP), ensuring a steady pipeline of high-quality training and prediction data. Optimizing Transformations with Serverless Computing: Offloading intensive transformations to serverless functions (GCP Cloud Run, AWS Lambda) and machine learning models (BigQuery ML, Sagemaker), integrating their outputs seamlessly into Airflow workflows. Real-World Impact: A case study on how INTRVL leveraged Airflow, BigQuery ML, and Cloud Run to analyze early voting data in near real-time, generating actionable insights on voter behavior across swing states. This talk not only provides a deep dive into the Political Tech space but also serves as a reference architecture for building robust, repeatable ELT pipelines. Attendees will gain insights into modern serverless technologies from AWS and GCP that enhance Airflow’s capabilities, helping data engineers design scalable, cloud-agnostic workflows.

iKang Healthcare Group, serving nearly 10 million patients annually, built a centralized healthcare data hub powered by Apache Airflow to support its large-scale, real-time clinical operations. The platform integrates batch and streaming data in a lakehouse architecture, orchestrating complex workflows from data ingestion (HL7/FHIR) to clinical decision support. Healthcare data’s inherent complexity—spanning structured lab results to unstructured clinical notes—requires dynamic, reliable orchestration. iKang uses Airflow’s DAGs, extensibility, and workflow-as-code capabilities to address challenges like multi-system coordination, semantic data linking, and fault-tolerant automation. iKang extended Airflow with cross-DAG event triggers, task priority weights, LLM-driven clinical text processing, and a visual drag-and-drop DAG builder for medical teams. These innovations improved diagnostic turnaround, patient safety, and cross-system workflow visibility. iKang’s work demonstrates Airflow’s power in transforming healthcare data infrastructure and advancing intelligent, scalable patient care.

This session explores how to bring unit testing to SQL pipelines using Airflow. I’ll walk through the development of a SQL testing library that allows isolated testing of SQL logic by injecting mock data into base tables. To support this, we built a type system for AWS Glue tables using Pydantic, enabling schema validation and mock data generation. Over time, this type system also powered production data quality checks via a custom Airflow operator. Learn how this approach improves reliability, accelerates development, and scales testing across data workflows.

Apache Airflow’s REST API has evolved to support diverse orchestration needs, with managed services like MWAA introducing custom enhancements. One such feature, InvokeRestApi, enables dynamic interactions with external services while maintaining Airflow’s core orchestration capabilities. In this talk, we will explore the architectural design behind InvokeRestApi, detailing how it enhances API-driven workflows. Beyond the architecture, we’ll share key challenges and learnings from implementing and scaling Airflow’s REST API in production environments. Topics include authentication, performance considerations, error handling, and best practices for integrating external APIs efficiently. Attendees will gain a deeper understanding of Airflow’s API extensibility, its implications for workflow automation, and actionable insights for building robust, API-driven orchestration solutions. Whether you’re an Airflow user or an architect, this session will provide valuable takeaways for simplifying API interactions across airflow environments.

In this talk, we will introduce the DAG Management Service (DMS), developed to address critical challenges in managing Airflow clusters. With over 10,000 active DAGs, a single Airflow cluster faces scaling limits and noisy neighbor issues, impacting task scheduling SLAs. DMS enhances reliability by distributing DAGs across multiple clusters and enforcing proper configurations. We will also discuss how DMS streamlines Airflow version upgrades. Upgrading from an old Airflow version to the latest requires sequential updates and code modifications for over 10,000 DAGs. DMS proposes an efficient upgrade method, reducing dependency on users. Key functions of DMS include: DAG Deployment: Selectively deploys DAG files from GitHub to Airflow clusters via an event-driven pipeline. DAG Migration: Facilitates seamless DAG migration between clusters, supporting both cluster upgrades and team-specific deployments. Connections and Variables Management: Centralizes management of connection IDs and variables, ensuring consistency and smooth migrations. Join us to explore how DMS can revolutionize your Airflow DAG management, enhancing scalability, reliability, and efficiency.

Small retailers often lack the data visibility that larger companies rely on for decision-making. In this session, we’ll dive into how Apache Airflow powers end-to-end machine learning pipelines that process inventory and sales data, enabling retailers and suppliers to gain valuable industry insights. We’ll cover feature engineering, model training, and automated inference workflows, along with strategies for handling messy, incomplete retail data. We will discuss how Airflow enables scalable ML-driven insights that improve demand forecasting, product categorization, and supply chain optimization.