talk-data.com

Topic

GitHub

version_control collaboration code_hosting

Activities

tagged

Activity Trend

79 peak/qtr

2020-Q1 2026-Q1

Top Events

ADSP: Algorithms + Data Structures = Programs 154 Data Engineering Podcast 123 DataTalks.Club 104 Microsoft Ignite 2025 50 Data Skeptic 13 O'Reilly Data Engineering Books 12 DataTopics: All Things Data, AI & Tech 11 O'Reilly Data Science Books 11 Data + AI Summit 2025 10 SciPy 2025 10 Airflow Summit 2025 8 The Pragmatic Engineer 8

Top Speakers

Conor Hoekstra 154 Bryce Adelstein Lelbach (NVIDIA) 148 Tobias Macey 123 Ben Deane 40 Sean Parent (Adobe) 15 Kyle Polich 13 Tristan Brindle (C++ London Uni) 11 Gergely Orosz 7 Zach Laine 6 Mukundan Sankar 6 Paulo Vasconcellos 5 Michael YenChi Ho (Microsoft) 5

Activities

Showing filtered results

All Video Podcast Book

Filtering by: Airflow Summit 2025 ×

Benchmarking the Performance of Dynamically Generated DAGs

2025-07-01 · Airflow Summit 2025

session

by Rahul Vats , Tatiana Al-Chueyr Martins (Astronomer)

Airflow Astronomer Cosmos Kubernetes

As teams scale their Airflow workflows, a common question is: “My DAG has 5,000 tasks—how long will it take to run in Airflow?” Beyond execution time, users often face challenges with dynamically generated DAGs, such as: Delayed visualization in the Airflow UI after deployment. High resource consumption, leading to Kubernetes pod evictions and out-of-memory errors. While estimating the resource utilization in a distributed data platform is complex, benchmarking can provide crucial insights. In this talk, we’ll share our approach to benchmarking dynamically generated DAGs with Astronomer Cosmos ( https://github.com/astronomer/astronomer-cosmos) , covering: Designing representative and extensible baseline tests. Setting up an isolated, distributed infrastructure for benchmarking. Running reproducible performance tests. Measuring DAG run times and task throughput. Evaluating CPU & memory consumption to optimize deployments. By the end of this session, you will have practical benchmarks and strategies for making informed decisions about evaluating the performance of DAGs in Airflow.

Boosting dbt-core workflows performance with Airflow’s Deferrable capabilities

2025-07-01 · Airflow Summit 2025

session

by Pankaj Singh , Pankaj Koti , Tatiana Al-Chueyr Martins (Astronomer)

Airflow Astronomer Cloud Computing Cosmos dbt

Efficiently handling long-running workflows is crucial for scaling modern data pipelines. Apache Airflow’s deferrable operators help offload tasks during idle periods — freeing worker slots while tracking progress. This session explores how Cosmos 1.9 ( https://github.com/astronomer/astronomer-cosmos ) integrates Airflow’s deferrable capabilities to enhance orchestrating dbt ( https://github.com/dbt-labs/dbt-core ) in production, with insights from recent contributions that introduced this functionality. Key takeaways: Deferrable Operators: How they work and why they’re ideal for long-running dbt tasks. Integrating with Cosmos: Refactoring and enhancements to enable deferrable behaviour across platforms. Performance Gains: Resource savings and task throughput improvements from deferrable execution. Challenges & Future Enhancements: Lessons learned, compatibility, and ideas for broader support. Whether orchestrating dbt models on a cloud warehouse or managing large-scale transformations, this session offers practical strategies to reduce resource contention and boost pipeline performance.

DAGnostics: Shift-Left Airflow Governance with Policy Enforcement Framework

2025-07-01 · Airflow Summit 2025

session

by Yifan (Stefan) Wang (LinkedIn)

Airflow CI/CD Python

DAGnostics seamlessly integrates Airflow Cluster Policy hooks to enforce governance from local DAG authoring through CI pipelines to production runtime. Learn how it closes validation gaps, collapses feedback loops from hours to seconds, and ensures consistent policies across stages. We examine current runtime-only enforcement and fractured CI checks, then unveil our architecture: a pluggable policy registry via Airflow entry points, local static analysis for pre-commit validation, GitHub Actions CI integration, and runtime hook enforcement. See real-world use cases: alerting standards, resource quotas, naming conventions, and exemption handling. Next, dive into implementation: authoring policies in Python, auto-discovery, cross-environment enforcement, upstream contribution, and testing strategies. We share LinkedIn’s metrics—2,000+ DAG repos, 10,000+ daily executions supporting trunk-based development across isolated teams/use-cases, and 78% fewer runtime violations—and lessons learned scaling policy-as-code at enterprise scale. Leave with a blueprint to adopt DAGnostics and strengthen your Airflow governance while preserving full compatibility with existing systems.

Dynamic DAGs and Data Quality using DAGFactory

2025-07-01 · Airflow Summit 2025

session

by Gangfeng Huang , Ashir Alam

Airflow BI Cloud Computing Data Quality GCP Cloud Composer Python YAML

We have a similar pattern of DAGs running for different data quality dimensions like accuracy, timeliness, & completeness. To do this again and again, we would be duplicating and potentially introducing human error while doing copy paste of code or making people write same code again. To solve for this, we are doing few things: Run DAGs via DagFactory to dynamically generate DAGs using just some YAML code for all the steps we want to run in our DQ checks. Hide this behind a UI which is hooked to github PR open step, now the user just provides some inputs or selects from dropdown in UI and a YAML DAG is generated for them. This highlights the potential for DAGFactory to hide Airflow Python code from users and make it more accessible to Data Analysts and Business Intelligence along with normal Software Engg, along with reducing human error. YAML is the perfect format to be able to generate code, create a PR and DagFactory is the perfect fir for that. All of this is running in GCP Cloud Composer.

Enhancing DAG Management with DMS: A Scalable Solution for Airflow

2025-07-01 · Airflow Summit 2025

session

by DaeHoon Song , Sungji Yang

Airflow

In this talk, we will introduce the DAG Management Service (DMS), developed to address critical challenges in managing Airflow clusters. With over 10,000 active DAGs, a single Airflow cluster faces scaling limits and noisy neighbor issues, impacting task scheduling SLAs. DMS enhances reliability by distributing DAGs across multiple clusters and enforcing proper configurations. We will also discuss how DMS streamlines Airflow version upgrades. Upgrading from an old Airflow version to the latest requires sequential updates and code modifications for over 10,000 DAGs. DMS proposes an efficient upgrade method, reducing dependency on users. Key functions of DMS include: DAG Deployment: Selectively deploys DAG files from GitHub to Airflow clusters via an event-driven pipeline. DAG Migration: Facilitates seamless DAG migration between clusters, supporting both cluster upgrades and team-specific deployments. Connections and Variables Management: Centralizes management of connection IDs and variables, ensuring consistency and smooth migrations. Join us to explore how DMS can revolutionize your Airflow DAG management, enhancing scalability, reliability, and efficiency.

From DAGs to Insights: Business-Driven Airflow Use Cases

2025-07-01 · Airflow Summit 2025

session

by Tala Karadsheh

Airflow

Airflow is integral to GitHub’s data and insight generation. This session dives into use cases from GitHub where key business decisions are driven, at the root, with the help of Airflow. The session will also highlight how both GitHub and Airflow celebrate, promote, and nurture OSS innovations in their own ways.

GitHub's Airflow Journey: Lessons, Mistakes, and Insights

2025-07-01 · Airflow Summit 2025

session

by Oleksandr Slynko

Airflow Cloud Computing Data Engineering

This session explores how GitHub uses Apache Airflow for efficient data engineering. We will share nearly 9 years of experiences, including lessons learnt, mistakes made, and the ways we reduced our on-call and engineering burden. We’ll demonstrate how we keep data flowing smoothly while continuously evolving Airflow and other components of our data platform, ensuring safety and reliability. The session will touch on how we migrate Airflow between cloud without user impact. We’ll also cover how we cut down the time from idea to running a DAG in production, despite our Airflow repo being among the top 15 by number of PRs within GitHub. We’ll dive into specific techniques such as testing connections and operators, relying on dag-sync, providing short-lived development environments to let developers test their DAG runs, and creating reusable patterns for DAGs. By the end of this session, you will gain practical insights and actionable strategies to improve your own data engineering processes.

The Secret to Airflow's Evergreen Build: CI/CD magic

2025-07-01 · Airflow Summit 2025

session

by Jarek Potiuk (Apache Software Foundation) , Pavan kumar Gopidesu , Amogh Rajesh Desai

Airflow API CI/CD

Have you ever wondered why Apache Airflow builds are asymptotically() green? That thrive for “perennial green build” is not magic, it’s the result of continuous, often unseen engineering effort within our CI/CD pipelines & dev environments. This dedication ensures that maintainers can work efficiently & contributors can onboard smoothly. To tackle the ever growing contributor base, we have a CI/CD team run by volunteers putting in significant work in the foundational tooling. In this talk, we reveal some innovative solutions we have implemented like: Handling GitHub Actions pull_request_target challenges Restructuring the repo for better clarity Slack bot for CI failure alerts A cherry picker workflow for releases Pre-commit hooks Faster website and image builds Tackling the new GitHub API rate limits Solving chicken-and-egg build issues during releases Join us to understand the “why” & “how” behind these infra components. You’ll gain insights into the continuous effort required to support a thriving open-source project like Airflow and, hopefully, be inspired to contribute to these areas. () asymptotically = we fix failures as quickly as we can when they happen