talk-data.com talk-data.com

Topic

Airflow

Apache Airflow

workflow_management data_orchestration etl

682

tagged

Activity Trend

157 peak/qtr
2020-Q1 2026-Q1

Activities

682 activities · Newest first

Open Source doc edits provide a low-stakes way for new users to first contribute. Ideally, new users find opportunities and feel welcome to fix docs as they learn, engaging with the community from the start. But, I found that contributing docs to Airflow had some surprising obstacles. In this talk, I’ll share my first docs contribution journey, including problems and fixes. For example, you must understand how Airflow uses Sphinx and know when to choose to edit in the GitHub UI or locally. But it wasn’t documented that GitHub renders only Markdown previews and since Sphinx uses markup, you must build docs locally to check formatting; an opportunity for me to add to the Contributor Guide for docs. In addition to examples of reducing obstacles, this talk covers the importance of docs for community and available resources to start writing. If you already contribute and want to create opportunities for others, I’ll also share characteristics of good first issues and docs projects.

High-scale orchestration of genomic algorithms using Airflow workflows, AWS Elastic Container Service (ECS), and Docker. Genomic algorithms are highly demanding of CPU, RAM, and storage. Our data science team requires a platform to facilitate the development and validation of proprietary algorithms. The Data engineering team develops a research data platform that enables Data Scientists to publish docker images to AWS ECR and run them using Airflow DAGS that provision AWS’s ECS compute power of EC2 and Fargate. We will describe a research platform that allows our data science team to check their algorithms on ~1000 cases in parallel using airflow UI and dynamic DAG generation to utilize EC2 machines, auto-scaling groups, and ECS clusters across multiple AWS regions.

Airflow DAGs are Python code (which can pretty much do anything you want) and Airflow has hundreds configuration options (which can dramatically change Airflow behavior). Those two facts contribute to endless combinations that can run the same workloads, but only a precious few are efficient. The rest will result in failed tasks and excessive compute usage, costing time and money. This talk will demonstrate how small changes can yield big dividends, and reveals some code improvements and Airflow configurations that can reduce costs and maximize performance.

As a team that has built a Time-Series Data Lakehouse at Bloomberg, we looked for a workflow orchestration tool that could address our growing scheduling requirements. We needed a tool that was reliable and scalable, but also could alert on failures and delays to enable users to recover quickly from them. From using triggers over simple sensors to implementing custom SLA monitoring operators, we explore our choices in designing Airflow DAGs to create a reliable data delivery pipeline that is optimized for failure detection and remediation.

The ability to create DAGs programmatically opens up new possibilities for collaboration between Data Science and Data Engineering. Engineering and DevOPs are typically incentivized by stability whereas Data Science is typically incentivized by fast iteration and experimentation. With Airflow, it becomes possible for engineers to create tools that allow Data Scientists and Analysts to create robust no-code/low-code data pipelines for feature stores. We will discuss Airlow as a means of bridging the gap between data infrastructure and modeling iteration as well as examine how a Qbiz customer did just this by creating a tool which allows Data Scientists to build features, train models and measure performance, using cloud services, in parallel.

Cloudera Data Engineering (CDE) is a serverless service for Cloudera Data Platform that allows you to submit various Spark jobs and Airflow DAGs to an auto-scaling cluster. Running your workloads as Python DAG files may be the usual, but not the most convenient way for some users as it involves a lot of background around syntaxes, the programming language, aesthetics of Airflow, etc. The DAG Authoring UI is a tool built on top of Airflow APIs to allow one to use a graphical user interface to create, manage, and destroy complex DAGs. The DAG authoring UI will give one the ability to perform tasks on Airflow without really having to know DAG structure, Python programming language, and the internals of Airflow. CDE has identified multiple operators to perform various tasks on Airflow by carefully categorising the use cases. The operators range from BashOperator, PythonOperator, CDEJobRunOperator, CDWJobRunOperator Most use cases can be run as combinations of the operators provided.

Are you tired of spending countless hours testing your data pipelines, only to find that they don’t work as expected? Do you wish there was a better way to manage your data versions and streamline your testing processes? If so, this presentation is for you! Join us as we explore the problem domain of testing environments for data pipelines and take a deep dive into the available tools currently in use. We’ll introduce you to the game-changing concepts of data versioning and lakeFS and show you how to integrate these tools with Airflow to revolutionize your testing workflows. But don’t just take our word for it - witness this power firsthand with a live demo of testing an Airflow DAG on a snapshot of production data, providing a practical demo of the tools and techniques covered in the presentation. So don’t miss out on this opportunity to supercharge your testing processes and take your data pipelines to the next level.

Astronomer has hosted over 100 Airflow webinars designed to educate and inform the community on best practices, use cases, and new features. The goal of these events is to increase Airflow’s adoption and ensure everybody, from new users to experienced power users, can keep up with a project that is evolving faster than ever. When new releases come out every few months, it can be easy to get stuck in past versions of Airflow. Instead, we want existing users to know how new features can make their lives easier, new users to know that Airflow can support their use case, and everybody to know how to implement the features they need and get them to production. This talk will cover some of the key learnings we’ve gathered from 2.5 years of conducting webinars aimed at supporting the community in growing their Airflow use, including how to best cater DevRel efforts to the many different types of Airflow users and how to effectively push for the adoption of new Airflow features.

For the dag owner, testing Airflow DAGs can be complicated and tedious. Kubectl cp your dag from local to pod, exec into the pod, and run a command? Install breeze? Why pull the Airflow image and start up the webserver / scheduler / triggerer if all we want is to test the addition of a new task? It doesn’t have to be this hard. At Etsy, we’ve simplified testing dags for the dag owner with dagtest. Dagtest is a Python package that we house on our internal PyPi. It is a small client binary that makes HTTP requests to a test API. The test API is a simple Flask server that receives these requests and builds pods to run airflow dags backfill commands based on the options provided via dagtest. The simplest of these is a dry-run. Typically, users run test runs where the dag executes E2E for a single ds. Equally important is the environment setup. We use an adhoc Airflow instance in a separate GCP environment with an SA that cannot write to Production buckets. This talk will discuss both.

Today, all major cloud service providers and 3rd party providers include Apache Airflow as a managed service offering in their portfolios. While these cloud based solutions help with the undifferentiated heavy lifting of environment management, some data teams are also looking to operate self-managed Airflow instances to satisfy specific differentiated capabilities. In this session, we would talk about: Why should you might need to run self managed Airflow The available deployment options (with emphasis on Airflow on Kubernetes) How to deploy Airflow on Kubernetes using automation (Helm Charts & Terraform) Developer experience (sync DAGs using automation) Operator experience (Observability) Owned responsibilities and Tradeoffs A thorough understanding would help you understand the end-to-end perspectives of operating a highly available and scalable self managed Airflow environment to meet your ever growing workflow needs.

Data platform teams often find themselves in a situation where they have to provide Airflow as a service to downstream teams, as more users and use cases in their organization require an orchestrator. In these situations, it’s giving each team it’s own Airflow environment can unlock velocity and actually be lower overhead to maintain than a monolithic environment. This talk will be about things to keep in mind when building an Airflow service that supports several environments, persona of users, and use cases. Namely, we’ll discuss principles to keep in mind when balancing centralized control over the data platform with decentralized teams using Airflow in a way that they’ll need. This will include things around observability, developer productivity, security, and infrastructure. We’ll also talk about day 2 concerns around overheard, infrastructure maintenance, and other tradeoffs to consider.

As much as we love airflow, local development has been a bit of a white whale through much of its history. Until recently, Airflow’s local development experience has been hindered by the need to spin up a scheduler and webserver. In this talk, we will explore the latest innovation in Airflow local development, namely the “dag.test()” functionality introduced in Airflow 2.5. We will delve into practical applications of “dag.test()”, which empowers users to locally run and debug Airflow DAGs on a single python process. This new functionality significantly improves the development experience, enabling faster iteration and deployment. In this presentation, we will discuss: How to leverage IDE support for code completion, linting, and debugging; Techniques for inspecting and debugging DAG output, Best practices for unit testing DAGs and their underlying functions. Accessible to Airflow users of all levels, join us as we explore the future of Airflow local development!

Reliability is a complex and important topic. I will focus on both reliability definition and best practices. I will begin by reviewing the Apache Airflow components that impact reliability. I will subsequently examine those aspects, showing the single points of failure, mitigations, and tradeoffs. The journey starts with the scheduling process. I will focus on the aspects of Scheduler infrastructure and configuration that address reliability improvements. It doesn’t run in a vacuum therefore I’ll share my observations on the reliability aspect of Scheduler infrastructure. We recommend tasks to be idempotent but that is not always possible. I will share the challenges of running user’s code in the distributed architecture of Cloud Composer. I will refer to the volatility of some cloud resources and mitigation methods in various scenarios. Deferrability plays important part in the reliability, but there are also other elements we shouldn’t ignore.

In this session, we’ll explore the inner workings of our warehouse allocation service and its many benefits. We’ll discuss how you can integrate these principles into your own workflow and provide real-world examples of how this technology has improved our operations. From reducing queue times to making smart decisions about warehouse costs, warehouse allocation has helped us streamline our operations and drive growth. With its seamless integration with Airflow, building an in-house warehouse allocation pipeline is simple and can easily fit into your existing workflow. Join us for this session to unlock the full potential of this in-house service and take your operations to the next level. Whether you’re a data engineer or a business owner, this technology can help you improve your bottom line and streamline your operations. Don’t miss out on this opportunity to learn more and optimize your workflow with the warehouse allocation service.

Airflow, traditionally used by Data Engineers, is now popular among Analytics Engineers who aim to provide analysts with high-quality tooling while adhering to software engineering best practices. dbt, an open-source project that uses SQL to create data transformation pipelines, is one such tool. One approach to orchestrating dbt using Airflow is using dynamic task mapping to automatically create a task for each sub-directory inside dbt’s staging, intermediate, and marts directories. This enables analysts to write SQL code that is automatically added as a dedicated task in Airflow at runtime. Combining this new Airflow feature with dbt best practices offers several benefits, such as analysts not needing to make Airflow changes and engineers being able to re-run subsets of dbt models should errors occur. In this talk, I would like to share some lessons I have learned while successfully implementing this approach for several clients.

Airflow is a powerful tool for orchestrating complex data workflows, which have undergone significant changes over the past two years. Since the Airflow release cycle has accelerated, you may struggle to keep up with the continuous flow of new features and improvements, which can lead to miss opportunities for addressing new use cases or solving your existing ones more efficiently. This presentation is intended to give you a solid update on the possibilities of Airflow and address misconceptions you may have heard or still believe that used to be valid but no longer are. At the end of this session, you will be able to use the essential features of Airflow, such as the Taskflow API, Datasets, or Dynamic Task Mapping, and you will know precisely what Airflow can and can’t do today. Fasten your seatbelt, take a deep breath, and let’s go 🚀

Summary

Data transformation is a key activity for all of the organizational roles that interact with data. Because of its importance and outsized impact on what is possible for downstream data consumers it is critical that everyone is able to collaborate seamlessly. SQLMesh was designed as a unifying tool that is simple to work with but powerful enough for large-scale transformations and complex projects. In this episode Toby Mao explains how it works, the importance of automatic column-level lineage tracking, and how you can start using it today.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their extensive library of integrations enable you to automatically send data to hundreds of downstream tools. Sign up free at dataengineeringpodcast.com/rudderstack- Your host is Tobias Macey and today I'm interviewing Toby Mao about SQLMesh, an open source DataOps framework designed to scale data transformations with ease of collaboration and validation built in

Interview

Introduction How did you get involved in the area of data management? Can you describe what SQLMesh is and the story behind it?

DataOps is a term that has been co-opted and overloaded. What are the concepts that you are trying to convey with that term in the context of SQLMesh?

What are the rough edges in existing toolchains/workflows that you are trying to address with SQLMesh?

How do those rough edges impact the productivity and effectiveness of teams using those

Can you describe how SQLMesh is implemented?

How have the design and goals evolved since you first started working on it?

What are the lessons that you have learned from dbt which have informed the design and functionality of SQLMesh? For teams who have already invested in dbt, what is the migration path from or integration with dbt? You have some built-in integration with/awareness of orchestrators (currently Airflow). What are the benefits of making the transformation tool aware of the orchestrator? What do you see as the potential benefits of integration with e.g. data-diff? What are the second-order benefits of using a tool such as SQLMesh that addresses the more mechanical aspects of managing transformation workfows and the associated dependency chains? What are the most interesting, innovative, or unexpected ways that you have seen SQLMesh used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on SQLMesh? When is SQLMesh the wrong choice? What do you have planned for the future of SQLMesh?

Contact Info

tobymao on GitHub @captaintobs on Twitter Website

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story. To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers

Links

SQLMesh Tobiko Data SAS AirBnB Minerva SQLGlot Cron AST == Abstract Syntax Tree Pandas Terraform dbt

Podcast Episode

SQLFluff

Podcast.init Episode

The intro and outro music is from The Hug by The Freak Fandango Orc