As teams scale their Airflow workflows, a common question is: “My DAG has 5,000 tasks—how long will it take to run in Airflow?” Beyond execution time, users often face challenges with dynamically generated DAGs, such as: Delayed visualization in the Airflow UI after deployment. High resource consumption, leading to Kubernetes pod evictions and out-of-memory errors. While estimating the resource utilization in a distributed data platform is complex, benchmarking can provide crucial insights. In this talk, we’ll share our approach to benchmarking dynamically generated DAGs with Astronomer Cosmos ( https://github.com/astronomer/astronomer-cosmos) , covering: Designing representative and extensible baseline tests. Setting up an isolated, distributed infrastructure for benchmarking. Running reproducible performance tests. Measuring DAG run times and task throughput. Evaluating CPU & memory consumption to optimize deployments. By the end of this session, you will have practical benchmarks and strategies for making informed decisions about evaluating the performance of DAGs in Airflow.
talk-data.com
Topic
Astronomer
36
tagged
Activity Trend
Top Events
Efficiently handling long-running workflows is crucial for scaling modern data pipelines. Apache Airflow’s deferrable operators help offload tasks during idle periods — freeing worker slots while tracking progress. This session explores how Cosmos 1.9 ( https://github.com/astronomer/astronomer-cosmos ) integrates Airflow’s deferrable capabilities to enhance orchestrating dbt ( https://github.com/dbt-labs/dbt-core ) in production, with insights from recent contributions that introduced this functionality. Key takeaways: Deferrable Operators: How they work and why they’re ideal for long-running dbt tasks. Integrating with Cosmos: Refactoring and enhancements to enable deferrable behaviour across platforms. Performance Gains: Resource savings and task throughput improvements from deferrable execution. Challenges & Future Enhancements: Lessons learned, compatibility, and ideas for broader support. Whether orchestrating dbt models on a cloud warehouse or managing large-scale transformations, this session offers practical strategies to reduce resource contention and boost pipeline performance.
As organizations scale their data infrastructure, Apache Airflow becomes a mission-critical component for orchestrating workflows efficiently. But scaling Airflow successfully isn’t just about running pipelines—it’s about building a Center of Excellence (CoE) that empowers teams with the right strategy, best practices, and long-term enablement. Join Jon Leek and Michelle Winters as they share their experiences helping customers design and implement Airflow Centers of Excellence. They’ll walk through real-world challenges, best practices, and the structured approach Astronomer takes to ensure teams have the right plan, resources, and support to succeed. Whether you’re just starting with Airflow or looking to optimize and scale your workflows, this session will give you a proven framework to build a sustainable Airflow Center of Excellence within your organization. 🚀
We’re excited to offer Airflow Summit 2025 attendees an exclusive opportunity to earn their DAG Authoring certification in person, now updated to include all the latest Airflow 3.0 features. This certification workshop comes at no additional cost to summit attendees. The DAG Authoring for Apache Airflow certification validates your expertise in advanced Airflow concepts and demonstrates your ability to build production-grade data pipelines. It covers TaskFlow API, Dynamic task mapping, Templating, Asset-driven scheduling, Best practices for production DAGs, and new Airflow 3.0 features and optimizations. The certification session includes: 20-minute preparation period with expert guidance Live Q&A session with Marc Lamberti from Astronomer 60-minute examination period Real-time results and immediate feedback To prepare for the Airflow Certification, visit the Astronomer Academy ( https://academy.astronomer.io/page/astronomer-certification) .
Airflow 3.0 is the most significant release in the project’s history, and brings a better user experience, stronger security, and the ability to run tasks anywhere, at any time. In this workshop, you’ll get hands-on experience with the new release and learn how to leverage new features like DAG versioning, backfills, data assets, and a new react-based UI. Whether you’re writing traditional ELT/ETL pipelines or complex ML and GenAI workflows, you’ll learn how Airflow 3 will make your day-to-day work smoother and your pipelines even more flexible. This workshop is suitable for intermediate to advanced Airflow users. Beginning users should consider taking the Airflow fundamentals course on the Astronomer Academy before attending this workshop.
Airflow 3 brings several exciting new features that better support MLOps: Native, intuitive backfills Removal of the unique execution date for dag runs Native support for event-driven scheduling These features, combined with the Airflow AI SDK, enable dag authors to easily build scalable, maintainable, and performant LLMOps pipelines. In this talk, we’ll go through a series of workflows that use the Airflow AI SDK to empower Astronomer’s support staff to more quickly resolve problems faced by Astronomer’s customers.
Ensuring high-quality data is essential for building user trust and enabling data teams to work efficiently. In this talk, we’ll explore how the Astronomer data team leverages Airflow to uphold data quality across complex pipelines; minimizing firefighting and maximizing confidence in reported metrics. Maintaining data quality requires a multi-faceted approach: safeguarding the integrity of source data, orchestrating pipelines reliably, writing robust code, and maintaining consistency in outputs. We’ve embedded data quality into the DevEx experience, so it’s always at the forefront instead of in the backlog of tech debt. We’ll share how we’ve operationalized: Implementing data contracts to define and enforce expectations Differentiating between critical (pipeline-blocking) and non-critical (soft) failures Exposing upstream data issues to domain owners Tracking metrics to measure overall data quality of our team Join us to learn practical strategies for building scalable, trustworthy data systems powered by Airflow.
The City of Pittsburgh utilizes Airflow (via Astronomer) for a wide variety of tasks. From employee-focused use cases, like time bank balancing and internal dashboards, to public-facing publication, the City’s data flows through our DAGs from many sources to many sources. Airflow acts as a funnel point and is an essential tool for Pittsburgh’s Data Services team.
As a popular open-source library for analytics engineering, dbt is often combined with Airflow. Orchestrating and executing dbt models as DAGs ensures an additional layer of control over tasks, observability, and provides a reliable, scalable environment to run dbt models. This workshop will cover a step-by-step guide to Cosmos , a popular open-source package from Astronomer that helps you quickly run your dbt Core projects as Airflow DAGs and Task Groups, all with just a few lines of code. We’ll walk through: Running and visualising your dbt transformations Managing dependency conflicts Defining database credentials (profiles) Configuring source and test nodes Using dbt selectors Customising arguments per model Addressing performance challenges Leveraging deferrable operators Visualising dbt docs in the Airflow UI Example of how to deploy to production Troubleshooting We encourage participants to bring their dbt project to follow this step-by-step workshop.
The role of data teams and data engineers is evolving. No longer just pipeline builders or dashboard creators, today’s data teams must evolve to drive business strategy, enable automation, and scale with growing demands. Best practices seen in the software engineering world (Agile development, CI/CD, and Infrastructure-as-code) from the DevOps movement are gradually making their way into data engineering. We believe these changes have led to the rise of DataOps and a new wave of best practices that will transform the discipline of data engineering. But how do you transform a reactive team into a proactive force for innovation? We’ll explore the key principles for building a resilient, high-impact data team—from structuring for collaboration, testing, automation, to leveraging modern orchestration tools. Whether you’re leading a team or looking to future-proof your career, you’ll walk away with actionable insights on how to stay ahead in the rapidly changing data landscape.
Airflow 3 is here, bringing a new era of flexibility, scalability, and security to data orchestration. This release makes building, running, and managing data pipelines easier than ever. In this session, we will cover the key benefits of Airflow 3, including: (1) Ease of Use: Airflow 3 rethinks the user experience—from an intuitive, upgraded UI to DAG Versioning and scheduler-integrated backfills that let teams manage pipelines more effectively than ever before (2) Stronger Security: By decoupling task execution from direct database connections, Airflow 3 enforces task isolation and minimal-privilege access. This meets stringent compliance standards while reducing the risk of unauthorized data exposure. (3) Ultimate Flexibility: Run tasks anywhere, anytime with remote execution and event-driven scheduling. Airflow 3 is designed for global, heterogeneous modern data environments with an architecture that facilitates edge and hybrid-cloud to GPU-based deployments.
Ray is an open-source framework for scaling Python applications, particularly machine learning and AI workloads. It provides the layer for parallel processing and distributed computing. Many large language models (LLMs), including OpenAI's GPT models, are trained using Ray.
On the other hand, Apache Airflow is a consolidated data orchestration framework downloaded more than 20 million times monthly.
This talk presents the Airflow Ray provider package that allows users to interact with Ray from an Airflow workflow. In this talk, I'll show how to use the package to create Ray clusters and how Airflow can trigger Ray pipelines in those clusters.
Summary In this episode of the Data Engineering Podcast Pete DeJoy, co-founder and product lead at Astronomer, talks about building and managing Airflow pipelines on Astronomer and the upcoming improvements in Airflow 3. Pete shares his journey into data engineering, discusses Astronomer's contributions to the Airflow project, and highlights the critical role of Airflow in powering operational data products. He covers the evolution of Airflow, its position in the data ecosystem, and the challenges faced by data engineers, including infrastructure management and observability. The conversation also touches on the upcoming Airflow 3 release, which introduces data awareness, architectural improvements, and multi-language support, and Astronomer's observability suite, Astro Observe, which provides insights and proactive recommendations for Airflow users.
Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementData migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details.Your host is Tobias Macey and today I'm interviewing Pete DeJoy about building and managing Airflow pipelines on Astronomer and the upcoming improvements in Airflow 3Interview IntroductionCan you describe what Astronomer is and the story behind it?How would you characterize the relationship between Airflow and Astronomer?Astronomer just released your State of Airflow 2025 Report yesterday and it is the largest data engineering survey ever with over 5,000 respondents. Can you talk a bit about top level findings in the report?What about the overall growth of the Airflow project over time?How have the focus and features of Astronomer changed since it was last featured on the show in 2017?Astro Observe GA’d in early February, what does the addition of pipeline observability mean for your customers? What are other capabilities similar in scope to observability that Astronomer is looking at adding to the platform?Why is Airflow so critical in providing an elevated Observability–or cataloging, or something simlar - experience in a DataOps platform? What are the notable evolutions in the Airflow project and ecosystem in that time?What are the core improvements that are planned for Airflow 3.0?What are the most interesting, innovative, or unexpected ways that you have seen Astro used?What are the most interesting, unexpected, or challenging lessons that you have learned while working on Airflow and Astro?What do you have planned for the future of Astro/Astronomer/Airflow?Contact Info LinkedInParting Question From your perspective, what is the biggest gap in the tooling or technology for data management today?Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story.Links AstronomerAirflowMaxime BeaucheminMongoDBDatabricksConfluentSparkKafkaDagsterPodcast EpisodePrefectAirflow 3The Rise of the Data Engineer blog postdbtJupyter NotebookZapiercosmos library for dbt in AirflowRuffAirflow Custom OperatorSnowflakeThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Ford Motor Company operates extensively across various nations. The Data Operations (DataOps) team for Advanced Driver Assistance Systems (ADAS) at Ford is tasked with the processing of terabyte-scale daily data from lidar, radar, and video. To manage this, the DataOps team is challenged with orchestrating diverse, compute-intensive pipelines across both on-premises infrastructure and the GCP and deal with sensitive of customer data across both environments The team is also responsible for facilitating the execution of on-demand, compute-intensive algorithms at scale through. To achieve these objectives, the team employs Astronomer/Airflow at the core of its strategic approach. This involves various deployments of Astronomer/Airflow that integrate seamlessly and securely (via Apigee) to initiate batch data processing and ML jobs on the cloud, as well as compute-intensive computer vision tasks on-premises, with essential alerting provided through the ELK stack. This presentation will delve into the architecture and strategic planning surrounding the hybrid batch router, highlighting its pivotal role in promoting rapid innovation and scalability in the development of ADAS features.
Astronomer’s data team recently underwent a major shift in how we work with Airflow. We’ll deep dive into the challenges which prompted that change, how we addressed them and where we are now. This re-architecture included: Switching to dataset scheduling and micro-pipelines to minimize failures and increase reliability. Implementing a Control DAG for complex dependency management and full end-to-end pipeline visibility. Standardized Task Groups for quick onboarding and scalability. With Airflow managing itself, we can once again focus on the data rather than the operational overhead. As proof we’ll share our favorite statistics from the terabyte of data we process daily revealing insights into how the world’s data teams use Airflow.
Data engineers have shifted from delivering data for internal analytics applications to customer-facing data products. And with that shift comes a whole new level of operational rigor necessary to instill trust and confidence in the data. How do you hold data pipelines to the same standards as traditional software applications? Can you apply principles learned from the field of SRE to the world of data? In this talk, we’ll explore how we’ve seen this evolve in Astronomer’s customer base and highlight best practices learned from the most critical data product applications we’ve seen. We’ll hear from Astronomer’s own data team as they went through the transformation from analytics to data products. And we’ll showcase a new product we’re building to help data teams around the world solve exactly this problem!
Airflow operators are a core feature of Apache Airflow and it’s extremely important that we maintain high quality of operators, prevent regressions and on the other hand we help developers with automated tests results to double check if introduced changes don’t cause regressions or backward incompatible changes and we provide Airflow release managers with information whether a given version of a provider should be released or not yet. Recently a new approach to assuring production quality was implemented for AWS, Google and Astronomer-provided operators - standalone Continuous Integration processes were configured for them and test results dashboards show the results of the last test runs. What has been working well for these operator providers might be a pattern to follow for others - during this presentation, AWS, Google and Astronomer engineers are going to share the information about the internals of Test Dashboards implemented for AWS, Google and Astronomer-provided operators. This approach might be a a ‘blueprint’ to follow for other providers.
The talk will cover how we use Airflow at the heart of our Workflow Management Platform(WFM) at Booking.com, enabling our internal users to orchestrate big data workflows on Booking Data Exchange(BDX). High level overview of the talk: Adapting open source Airflow helm chart to spin up Airflow installation in Booking Kubernetes Service (BKS) Coming up with Workflow definition format (yaml) Conversion of workflow.yaml to workflow.py DAGs Usage of Deferrable operators to provide standard step templates to users Workspaces (collection of workflows), using it to ensure role based access to DAG permissions for users Using okta for authentication Alerting, monitoring, logging Plans to shift to Astronomer
The integration between dbt and Airflow is a popular topic in the community, both in previous editions of Airflow Summit, in Coalesce and the #airflow-dbt Slack channel. Astronomer Cosmos ( https://github.com/astronomer/astronomer-cosmos/ ) stands out as one of the libraries that strives to enhance this integration, having over 300k downloads per month. During its development, we’ve encountered various performance challenges in terms of scheduling and task execution. While we’ve managed to address some, others remain to be resolved. This talk describes how Cosmos works, the improvements made over the last 1.5 years, and the roadmap. It also aims to collect feedback from the community on how we can further improve the experience of running dbt in Airflow.
In today’s data-driven era, ensuring data reliability and enhancing our testing and development capabilities are paramount. Local unit testing has its merits but falls short when dealing with the volume of big data. One major challenge is running Spark jobs pre-deployment to ensure they produce expected results and handle production-level data volumes. In this talk, we will discuss how Autodesk leveraged Astronomer to improve pipeline development. We’ll explore how it addresses challenges with sensitive and large data sets that cannot be transferred to local machines or non-production environments. Additionally, we’ll cover how this approach supports over 10 engineers working simultaneously on different feature branches within the same repo. We will highlight the benefits, such as conflict-free development and testing, and eliminating concerns about data corruption when running DAGs on production Airflow servers. Join me to discover how solutions like Astronomer empower developers to work with increased efficiency and reliability. This talk is perfect for those interested in big data, cloud solutions, and innovative development practices.