talk-data.com talk-data.com

Topic

Data Engineering

etl data_pipelines big_data

16

tagged

Activity Trend

127 peak/qtr
2020-Q1 2026-Q1

Activities

Showing filtered results

Filtering by: Airflow Summit 2025 ×

Take your DAGs in Apache Airflow to the next level? This is an insightful session where we’ll uncover 5 transformative strategies to enhance your data workflows. Whether you’re a data engineering pro or just getting started, this presentation is packed with practical tips and actionable insights that you can apply right away. We’ll dive into the magic of using powerful libraries like Pandas, share techniques to trim down data volumes for faster processing, and highlight the importance of modularizing your code for easier maintenance. Plus, you’ll discover efficient ways to monitor and debug your DAGs, and how to make the most of Airflow’s built-in features. By the end of this session, you’ll have a toolkit of strategies to boost the efficiency and performance of your DAGs, making your data processing tasks smoother and more effective. Don’t miss out on this opportunity to elevate your Airflow DAGs!

There was a post on the data engineering subreddit recently that discussed how difficult it is to keep up with the data engineering world. Did you learn Hadoop, great we are on Snowflake, BigQuery and Databricks now. Just learned Airflow, well now we have Airflow 3.0. And the list goes on. But what doesn’t change, and what have been the lessons over the past decade. That’s what I’ll be covering in this talk. Real lessons and realities that come up time and time again whether you’re working for a start-up or a large enterprise.

Airflow has been used by many companies as a core part of their internal data platform. Would you be interested in finding out how Airflow could play a pivotal role in achieving data engineering excellence and efficiency using modern data architecture. The best practices, tools and setup to achieve a stable but yet cost effective way of running small or big workloads, let’s find out! In this workshop we will review how an organisation can setup data platform architecture around Airflow and necessary requirements. Airflow and it’s role in Data Platform Different ways to organise airflow environment enabling scalability and stability Useful open source libraries and custom plugins allowing efficiency How to manage multi-tenancy, cost savings Challenges and factors to keep in mind using Success Matrix! This workshop should be suitable for any Architect, Data Engineer or Devops aiming to build/enhance their internal Data Platform. At the end of this workshop you would have solid understanding of initial setup and ways to optimise further getting most out of the tool for your own organisation.

How a Complete Beginner in Data Engineering / Junior Computer Science Student Became an Apache Airflow Committer in Just 5 Months—With 70+ PRs and 300 Hours of Contributions This talk is aimed at those who are still hesitant about contributing to Apache Airflow. I hope to inspire and encourage anyone to take the first step and start their journey in open-source—let’s build together!

Maintaining consistency, code quality, and best practices for writing Airflow DAGs between teams and individual developers can be a significant challenge. Trying to achieve it using manual code reviews is both time-consuming and error-prone. To solve this at Next, we decided to build a custom, internally developed linting tool for Airflow DAGs, to help us evaluate their quality and uniformity - we call it - DAGLint. In this talk I am going to share why we chose to implement it, how we built it, and how we use it to elevate our code quality and standards throughout the entire Data engineering group. This tool supports our day-to-day development process, provides us with a visual analysis of the state of our entire code base, and allows our code reviews to focus on other code quality aspects. We can now easily identify deviations from our defined standards, promote consistency throughout our DAGs repository, and extend the tool with additional new standards introduced to our group. The talk will cover how you can implement similar solution in your own organization, we also published a blog post on it https://medium.com/apache-airflow/mastering-airflow-dag-standardization-with-pythons-ast-a-deep-dive-into-linting-at-scale-1396771a9b90

At the enterprise level, managing Airflow deployments across multiple teams can become complex, leading to bottlenecks and slowed development cycles. We will share our journey of decentralizing Airflow repositories to empower data engineering teams with multi-tenancy, clean folder structures, and streamlined DevOps processes. We dive into how restructuring our Airflow architecture and utilizing repository templates allowed teams to generate new data pipelines effortlessly. This approach enables engineers to focus on business logic without worrying about underlying Airflow configurations. By automating deployments and reducing manual errors through CI/CD pipelines, we minimized operational overhead. However, this transformation wasn’t without challenges. We’ll discuss obstacles we faced, such as maintaining code consistency, variables, and utility functions across decentralized repositories; ensuring compliance in a multi-tenant environment; and managing the learning curve associated with new workflows. Join us to discover practical insights on how decentralizing Airflow repositories can boost team productivity and adapt to evolving business needs with minimal effort.

We will explore how Apache Airflow 3 unlocks new possibilities for smarter, more flexible DAG design. We’ll start by breaking down common anti-patterns in early DAG implementations, such as hardcoded operators, duplicated task logic, and rigid sequencing, that lead to brittle, unscalable workflows. From there, we’ll show how refactoring with the D.R.Y. (Don’t Repeat Yourself) principle, using techniques like task factories, parameterization, dynamic task mapping, and modular DAG construction, transforms these workflows into clean, reusable patterns. With Airflow 3, these strategies go further: enabling DAGs that are reusable across both batch pipelines and streaming/event-driven workloads, while also supporting ad-hoc runs for testing, one-off jobs, or backfills. The result is not just more concise code, but workflows that can flexibly serve different data processing modes without duplication. Attendees will leave with concrete patterns and best practices for building maintainable, production-grade DAGs that are scalable, observable, and aligned with modern data engineering standards.

This session explores how GitHub uses Apache Airflow for efficient data engineering. We will share nearly 9 years of experiences, including lessons learnt, mistakes made, and the ways we reduced our on-call and engineering burden. We’ll demonstrate how we keep data flowing smoothly while continuously evolving Airflow and other components of our data platform, ensuring safety and reliability. The session will touch on how we migrate Airflow between cloud without user impact. We’ll also cover how we cut down the time from idea to running a DAG in production, despite our Airflow repo being among the top 15 by number of PRs within GitHub. We’ll dive into specific techniques such as testing connections and operators, relying on dag-sync, providing short-lived development environments to let developers test their DAG runs, and creating reusable patterns for DAGs. By the end of this session, you will gain practical insights and actionable strategies to improve your own data engineering processes.

Metadata management is a cornerstone of effective data governance, yet it presents unique challenges distinct from traditional data engineering. At scale, efficiently extracting metadata from relational and NoSQL databases demands specialized solutions. To address this, our team has developed custom Airflow operators that scan and extract metadata across various database technologies, orchestrating 100+ production jobs to ensure continuous and reliable metadata collection. Now, we’re expanding beyond databases to tackle non-traditional data sources such as file repositories and message queues. This shift introduces new complexities, including processing structured and unstructured files, managing schema evolution in streaming data, and maintaining metadata consistency across heterogeneous sources. In this session, we’ll share our approach to building scalable metadata scanners, optimizing performance, and ensuring adaptability across diverse data environments. Attendees will gain insights into designing efficient metadata pipelines, overcoming common pitfalls, and leveraging Airflow to drive metadata governance at scale.

A real-world journey of how my small team at Xena Intelligence built robust data pipelines for our enterprise customers using Airflow. If you’re a data engineer, or part of a small team, this talk is for you. Learn how we orchestrated a complex workflow to process millions of public reviews. What You’ll Learn: Cost-Efficient DAG Designing: Decomposing complex processes into atomic tasks using the TaskFlow, XComs, Mapped tasks, and Task groups. Diving into one of our DAGs as a concrete example of how our approach optimizes parallelism, error handling, delivery speed, and reliability. Integrating LLM Analysis: Explore how we integrated LLM-based analysis into our pipeline. Learn how we designed the database, queries, and ingestion to Postgres. Extending Airflow UI: We developed a custom Airflow UI plugin that filters and visualizes DAG runs by customer, product, and marketplace, delivering clear insights for faster troubleshooting. Leveraging Airflow REST API: Discover how we leveraged the API to trigger DAGs on demand, elevating the UX by tracking mapped DAG progress and computing ETAs. CI/CD and Cost Management: Get practical tips for deploying DAGs with CI/CD.

MWAA is an AWS-managed service that simplifies the deployment and maintenance of the open-source Apache Airflow data orchestration platform. MWAA has recently introduced several new features to enhance the experience for data engineering teams. Features such as Graceful Worker Replacement Strategy that enable seamless MWAA environment updates with zero downtime, IPv6 support, and in place minor Airflow Version Downgrade are some of the many new improvements MWAA has brought to their users in 2025. Last, but not the least, the release of Airflow 3.0 support brings the latest open-source features introducing a new web-server UI, better isolation and security for environments. These enhancements demonstrate Amazon’s continued investment in making Airflow more accessible and scalable for enterprises through the MWAA service.

MWAA is an AWS-managed service that simplifies the deployment and maintenance of the open-source Apache Airflow data orchestration platform. MWAA has recently introduced several new features to enhance the experience for data engineering teams. Features such as Graceful Worker Replacement Strategy that enable seamless MWAA environment updates with zero downtime, IPv6 support, and in place minor Airflow Version Downgrade are some of the many new improvements MWAA has brought to their users in 2025. Last, but not the least, the release of Airflow 3.0 support brings the latest open-source features introducing a new web-server UI, better isolation and security for environments. These enhancements demonstrate Amazon’s continued investment in making Airflow more accessible and scalable for enterprises through the MWAA service.

How a Complete Beginner in Data Engineering / Junior Computer Science Student Became an Apache Airflow Committer in Just 5 Months—With 70+ PRs and 300 Hours of Contributions This talk is aimed at those who are still hesitant about contributing to Apache Airflow. I hope to inspire and encourage anyone to take the first step and start their journey in open-source—let’s build together!

As the demand for data products grows, data engineering teams face mounting pressure to deliver more and even faster, often becoming bottlenecks. Astro IDE changes the game. Astro IDE is an AI-powered code editor built for Apache Airflow. It helps data teams go from idea to production in minutes—generating production-ready DAGs, enabling in-browser testing, and integrating directly with Git. In this session, see how Astro IDE accelerates DAG creation, debugging, and deployment so data engineering teams can deliver more, 10x faster.

As the demand for data products grows, data engineering teams face mounting pressure to deliver more and even faster, often becoming bottlenecks. Astro IDE changes the game. Astro IDE is an AI-powered code editor built for Apache Airflow. It helps data teams go from idea to production in minutes—generating production-ready DAGs, enabling in-browser testing, and integrating directly with Git. In this session, see how Astro IDE accelerates DAG creation, debugging, and deployment so data engineering teams can deliver more, 10x faster.

In the rapidly evolving field of data engineering and data science, efficiency and ease of use are crucial. Our innovative solution offers a user-friendly interface to manage and schedule custom PySpark, PySQL, Python, and SQL code, streamlining the process from development to production. Using Airflow at the backend, this tool eliminates the complexities of infrastructure management, version control, CI/CD processes, and workflow orchestration.The intuitive UI allows users to upload code, configure job parameters, and set schedules effortlessly, without the need for additional scripting or coding. Additionally, users have the flexibility to bring their own custom artifactory solution and run their code. In summary, our solution significantly enhances the orchestration and scheduling of custom code, breaking down traditional barriers and empowering organizations to maximize their data’s potential and drive innovation efficiently. Whether you are an individual data scientist or part of a large data engineering team, this tool provides the resources needed to streamline your workflow and achieve your goals faster than ever before.