talk-data.com talk-data.com

Topic

Airflow

Apache Airflow

workflow_management data_orchestration etl

682

tagged

Activity Trend

157 peak/qtr
2020-Q1 2026-Q1

Activities

682 activities · Newest first

Airflow 2.2 introduced Deferrable Tasks (sometimes called “async operators”), a new mechanism to efficiently run tasks that depend on external activity. But when should you use them, how do they work, what do you need to do to make Operators support them, and what else could we do in Airflow with this model?

The task logging subsystem is one of most flexible, yet complex and misunderstood components of Airflow. In this talk, we will take a look at the various task log handlers that are part of the core Airflow distribution, and dig a bit deeper in the interfaces they implement and discuss how those can be used to roll your own logging implementation.

This workshop is sold out Data lineage might seem like a complicated and unapproachable topic, but that’s only because data pipelines are complicated. The core concept is straightforward: trace and record the journey of datasets as they travel through a data pipeline. Marquez, a lineage metadata server, is a simple thing designed to watch complex things. It tracks the movement of data through complex pipelines using a straightforward, clear object model of Jobs, Datasets, and Runs. The information it gathers can be used to help you more effectively understand, communicate, and solve problems. The interactive UI allows you to see exactly where any inefficiencies have developed or datasets have become compromised. In this workshop, you will learn how to collect and visualize lineage from a basic Airflow pipeline using Marquez. You will need to understand the basics of Airflow, but no experience with lineage is required.

Automatic Speech Recognition is quite a compute intensive task, which depends on complex Deep Learning models. To do this at scale, we leveraged the power of Tensorflow, Kubernetes and Airflow. In this session, you will learn about our journey to tackle this problem, main challenges, and how Airflow made it possible to create a solution that is powerful, yet simple and flexible.

As a data engineer, backfilling data is an important part of your day-to-day work. But, backfilling interdependent DAGs is time-consuming and often associated with an unpleasant experience. For example, let’s say you were tasked with backfilling a few months worth of data. You’re given the start and end date for the backfill that will be used to run an ad-hoc backfilling script that you have painstakingly crafted locally on your machine. As you sip your morning coffee, you kick off the backfilling script, hoping it’ll work, and think to yourself, there must be a better way. Yes, there is, and collecting DAG lineage metadata would be a great start! In this talk, Willy Lulciuc will briefly introduce you to how backfills are handled in Airflow, then discuss how DAG lineage metadata stored in Marquez can be used to automate backfilling DAGs with complex upstream and downstream dependencies.

Testing is an important part of the DataOps life cycle, giving teams confidence in the integrity of their data as it moves downstream to production systems. But what happens when testing doesn’t catch all of your bad data and “unknown unknown” data quality issues fall through the cracks? Fortunately, data engineers can apply a thing or two from DevOps best practices to tackle data quality at scale with circuit breakers, a novel approach to stopping bad data from actually entering your pipelines in the first place. In this talk, Prateek Chawla, Founding Team Member and Technical Lead at Monte Carlo, will discuss what circuit breakers are, how to integrate them with your Airflow DAGs, and what this looks like in practice. Time permitting, Prateek will also walk through how to build and automate Airflow circuit breakers across multiple cascading pipelines with Python and other common tools.

Organizations need to effectively manage large volumes of complex, business-critical workloads across multiple applications and platforms. Choosing the right workflow orchestration tool is important as it can help teams effectively automate the configuration, coordination, integration, and data management processes on several applications and systems. Currently there are a lot of tools (both open sourced and proprietary) available for orchestrating tasks and data workflows with automation features. Each of them claim to focus on ensuring a centralized, repeatable, reproducible, and efficient workflows coordination. Choosing one among them is an arduous task as it requires an in-depth understanding of the capabilities that these tools have to offer that translate to your specific engineering needs. Apache Airflow which is a powerful and widely-used open-source workflow management system (WMS) designed to programmatically author, schedule, orchestrate, and monitor data pipelines and workflows. In this talk, understand how Apache Airflow compares with other popular orchestration tools in-terms of architecture, scalability, management, observability, automation, native features, cost, available integrations and more. Get a head-to-head comparison of what’s possible as we dissect capabilities of the tools against the other. This comparative analysis will help you in your decision making process, whether you are planning to migrate an existing system or evaluating for your first enterprise orchestration platform.

Data within today’s organizations has become increasingly distributed and heterogeneous. It can’t be contained within a single brain, a single team, or a single platform…but it still needs to be comprehensible, especially when something unexpected happens. Data lineage can help by tracing the relationships between datasets and providing a cohesive graph that places them in context. OpenLineage provides a standard for lineage collection that spans multiple platforms, including Apache Airflow and Apache Spark. In this session, Michael Collado from Datakin will show how to trace data lineage and useful operational metadata in Apache Spark and Airflow pipelines, and talk about how OpenLineage fits in the context of data pipeline operations and provides insight into the larger data ecosystem.

In this talk, we explain how Apache Airflow is at the center of our Kubernetes-based Data Science Platform at PlayStation. We talk about how we built a flexible development environment for Data Scientists to interact with Apache Airflow and explain the tools and processes we built to help Data Scientists promote their dags from development to production. We will also talk about the impact of containerization and the usage of KubernetesOperator and the new SparkKubernetesOperator and the benefits of deploying Airflow in Kubernetes using the KubernetesExecutor across multiple environments.

In Airflow 2.3 the ability to change the number of tasks dynamically opens up some exciting new ways of building DAGs and lets us create new patterns that just weren’t possible before. In this session I will cover a little bit about AIP-42 and the interface for Dynamic Task Mapping, and cover some common use cases and patterns.

Sneak peek at the future of the Airflow UI. In Airflow 2.3 with the Tree -> Grid view changes, we began to swap out parts of the Flask app with React. This was one step towards AIP-38, to build a fully modern UI for Airflow. Come check out what is in store after Grid view in the current UI. Discuss the possibilities to rethink Airflow with a brand new UI down the line. Such as: Integrating all DAG visualizations into each other and remove constant page reloads More live data Greater cross-DAG visualizations (ie: DAG Dependencies view from 2.1) Improved user settings: (dark mode, color blind support, language, date format) And more!

The use of version control and continuous deployment in a data pipeline is one of the biggest features unlocked by the modern data stack. In this talk, I’ll demonstrate how to use Airbyte to pull data into your data warehouse, dbt to generate insights from your data, and Airflow to orchestrate every step of the pipeline. The complete project will be managed by version control and continuously deployed by Github. This talk will share how to achieve a more secure, scalable, and manageable workflow for your data projects.

This talk is a walk through throug a number of ways maintainers of open-source projects (for example Airflow) can improve the communication with their users by exercising empathy. This subject is often overlooked in the cirriculum of average developer and contributor, but one that can make or break the product you developed, simply because it will become more approachable for users. Maintainers often forget or simply do not realize how many assumptions they have in their head. There are a number of techniques maintainers can use to improve it. This talk will walk through a number of examples (from Airflow and other projects), reasoning and ways how communication between maintainers and users can be improved - in the code, documentation, communication but also with involving and engaging the users they are commmunicating with, as more often than not - the users might be of great help when it comes to communication with them - if only asked. This talk is for both - maintainers and users, as I consider communication between users and maintainers two way street.

Nothing is perfect, but it doesn’t mean we shouldn’t seek perfection. After some time spent with Airflow system tests, we have recognized numerous places in which we can make significant improvements. We decided to rediscover them. The new design started with the establishment of goals. Tests need to: be easy to write, read, run and maintain, be as close as possible to how Airflow runs in practice, be fast, reliable and verifiable, assure high quality of Airflow Operators. With these principles in mind, we prepared an Airflow Improvement Proposal (AIP-47) and after it’s confirmed, we started the implementation. The results of our work were better than we expected when we started this initiative. This session will walk you through the story of how we struggled to run system tests before, how we came up with the improvements, and how we put them into a working solution.

Have data quality issues? What about reliability problems? You may be hearing a lot of these terms, along with many others, that describe issues you face with your data. What’s the difference, which are you suffering from, and how do you tackle both? Knowing that your Airflow DAGs are green is not enough. It’s time to focus on data reliability and quality measurements to build trust in your data platform. Join this session to learn how Databand’s proactive data observability platform makes it easy to achieve trusted data in your pipelines. In this session we’ll cover: The core differences between data reliability vs data quality What role the data platform team, analysts, and scientists play in guaranteeing data quality Hands-on demo setting up Airflow reliability tracking

Managing Airflow in large-scale environments is tough. You know this, and I know this. But, what if you had a guide to make development, testing, and production lifecycles more manageable? In this presentation, I will share how we manage Airflow for large-scale environments with friendly deployments at every step. After attending the session, Airflow engineers will: Understand the advantages of each kind of deployment Know the differences between Deployment and Airflow Executor Deploy how to incorporate all kinds of deployments for their day-to-day needs

Needing to trigger DAGs based on external criteria is a common use case for data engineers, data scientists, and data analysts. Most Airflow users are probably aware of the concept of sensors and how they can be used to run your DAGs off of a standard schedule, but sensors are only one of multiple methods available to implement event-based DAGs. In this session, we’ll discuss different ways of implementing event-based DAGs using Airflow 2 features like the API and deferrable operators, with a focus on how to determine which method is the most efficient, scalable, and cost-friendly for your use case.

We the Data Engineering Team here at WB Games implemented an internal Redshift Loader DAG(s) on Airflow that allow us to ingest data in near real-time at scale into Redshift, taking into account variable load on the DB and been able to quickly catch up data loads in case of various DB outages or high usage scenarios. Highlights: Handle any type of Redshift outages and system delays dynamically between multiple sources(S3) to sinks(Redshift). Auto tuning data copies for faster data backfill in case of delay without overwhelming commit queue. Supports schema evolution on Game data dynamically. Maintain data quality to ensure we do not create data gaps or dupes. Provide embedded custom metrics for deeper insights and anomaly detection. Airflow config based Declarative Dag implementation.

Imagine if you could chain together SQL models using nothing but python, write functions that treat Snowflake tables like dataframes and dataframes like SQL tables. Imagine if you could write a SQL airflow DAG using only python or without using any python at all. With Astro SDK, we at Astronomer have gone back to the drawing board around fundamental questions of what DAG writing could look like. Our goal is to empower Data Engineers, Data Scientists, and even the Business Analysts to write Airflow DAGs with code that reflects the data movement, instead of the system configuration. Astro will allow each group to focus on producing value in their respective fields with minimal knowledge of Airflow and high amounts of flexibility between SQL or python-based systems. This is way beyond just a new way of writing DAGs. This is a universal agnostic data transfer system. Users can run the exact same code against different databases (snowflake, bigquery, etc.) and datastores (GCS, S3, etc.) with no changes except to the connection IDs. Users will be able to promote a SQL flow from their dev postgres to their prod snowflake with a single variable change. We are ecstatic to reveal over eight months of work around building a new open-source project that will significantly improve your DAG authoring experience!