talk-data.com talk-data.com

Topic

Airflow

Apache Airflow

workflow_management data_orchestration etl

81

tagged

Activity Trend

157 peak/qtr
2020-Q1 2026-Q1

Activities

Showing filtered results

Filtering by: Airflow Summit 2023 ×

Apache Airflow is a popular workflow platform, but it often faces critiques that may not paint the whole picture. In this talk, we will unpack the critiques of Apache Airflow and provide a balanced analysis. We will highlight the areas where these critiques correctly point out Airflow’s weaknesses, debunk common myths, and showcase where competitors like Dagster and Prefect are excelling. By understanding the pros and cons of Apache Airflow, attendees will be better equipped to make informed decisions about whether Airflow is the right choice for their use cases. This talk will provide a comprehensive and objective assessment of Apache Airflow and its place in the workflow management ecosystem. Notes: What Critics Get Right About Airflow’s Weaknesses Debunking Myths and Misconceptions About Airflow Competitor Analysis Real-World Use Cases: When Airflow Shines Making Informed Decisions: Choosing the Right Workflow Platform

At King, data is fundamental in helping us deliver the best possible experiences for the players of our games while continually bringing them new, innovative and evolving gameplay features. Data has to be “always-on”, where downtime and accuracy is treated with the same level of diligence as any of our games and success is measured against internal SLAs. How is King using ‘data reliability engineering as code’ tools such as SodaCore within Airflow pipelines to detect, diagnose and inform about data issues to create coverage, improve quality & accuracy and help eliminate data downtime?

Discover PepsiCo’s dynamic data quality strategy in a multi-cloud landscape. Join me, the Director of Data Engineering, as I unveil our Airflow utilization, custom operator integration, and the power of Great Expectations. Learn how we’ve harmonized Data Mesh into our decentralized development for seamless data integration. Explore our journey to maintain quality and enhance data as a strategic asset at PepsiCo.

Productive cross-team collaboration between data engineers and analysts is the goal of all data teams, however, fulfilling on that mission can be challenging given the diverse set of skills that each group brings. In this talk we present an example of how one team tackled this topic by creating a flexible, dynamic and extensible framework using Airflow and cloud services that allowed engineers and analysts to jointly create data-centric micro-services to serve up projections and other robust analysis for use in the organization. The framework, which utilized dynamic DAG generation configured using yaml files, Kubernetes jobs and dbt transformations, abstracted away many of the details associated with workflow orchestration, allowing analysts to focus on their Python or R code and data processing logic while enabling data engineers to monitor the pipelines and ensure their scalability.

Kiwi.com started using Airflow in June 2016 as an orchestrator for several people in the company. The need for the tool grew and the monolithic instance was used by 30+ teams having 500+ DAGs active resulting in 3.5 million tasks/month successfully finished. At first, we moved to using a monolithic Airflow environment, but our needs quickly changed as we wanted to support a data mesh architecture within kiwi.com. By leveraging Astronomer on GCP, we were able to move from a monolithic Airflow environment to many smaller instances of Airflow. This talk will go into how to handle things like DAG dependencies, observability, and stakeholder management. Furthermore, we’ll talk about security, particularly how GCP’s workload identity helped us achieve a passwordless Airflow experience.

Have you ever added a DAG file and had no clue what happened to it? You’re not alone! With default settings, Airflow can wait up to 5 minutes before processing new DAG files. In this talk, I’ll discuss the implementation of an event-based DAG parser that immediately processes changes in the DAGs folder. As a result, changes are reflected immediately in the Airflow UI. In this talk I will cover: A demonstration of the event-based DAG parser and the fast Airflow UI experience How the current DAG parser implementation and configuration works How an event-based DAG parser is implemented

As user of Airflow we often use DagRun.conf attributes to control content and flow of a DAG run. Previously the Airflow UI only allowed to launch via JSON in the UI. This was technically feasible but not user friendly. A user needs to model, check and understand the JSON and enter parameters manually without the option to validate before trigger. Similar like Jenkins or Github/Azure pipelines we desire an UI option to trigger with a UI and specifying parameters. With Airflow 2.6.0 now the DAG.params are used to render a nice entry form and with a bit of options a user friendly trigger UI can be implemented. This session is showing how the new feature works and provides some examples how to use it for your purposes.

As a data engineer, I’ve used Airflow extensively over the last 5 years: across 3 jobs, several different roles; for side projects, for critical infrastructure; for manually triggered jobs, for automated workflows; for IT (Ookla/Speedtest.net), for science (Allen Institute for Cell Science), for the commons (Openverse), for liberation (Orca Collective). Authoring a DAG has changed dramatically since 2018, thanks to improvements in both Airflow and the Python language. In this session, we’ll take a trip back in time to see how DAGs looked several years ago, and what the same DAGs might look like now. We’ll appreciate the many improvements that have been made towards simplifying workflow construction. I’ll also discuss the significant advancements that have been made around deploying Airflow. Lastly, I’ll give a brief overview of different use cases and ways I’ve seen Airflow leveraged.

How to submit an Issue for community to fix To ensure a quality product the Airflow community relies on bug reports from Airflow users. Often times bug reports are incomplete or fail to include steps for observed bug to be re-created. This workshop will present an example of bug-to-issue process, namely, how to rule out non-Airflow issues and once an Airflow issue is suspected, how to submit an issue for community to see. This could also cover how the community picks up on an issue and eventually fixes it in a future release. Issue reporting is key to improving Airflow and the community will benefit on an easily digestible guide of how best to do so.

Are you tired of spending hours on Airflow migrations and wondering how to make them more accessible? Would you like to be able to test your code on different Airflow versions? or are you struggling to set up a reliable local development environment? These are some of the top pain points for data engineers working with Airflow. But fear not – Wix Data Engineering has some best practices to share that will make your life easier. What will the audience learn: How does Wix Data Engineering make Airflow migrations easier and less painful. How to ensure DEs code is forward-compatible with the latest Airflow version. How to test code on different Airflow versions How to maintain a stable local environment for DEs while speeding up their dev velocity. Some more must-know framework team’s best practices.

We are continuing to modernize the Airflow UI to make it easier to manage all aspects of your DAGs. See a demo of the latest updates and improve your workflows with new tips and tricks. Then get a preview of what else will be coming soon. Followed up by Q&A for people to field their own use-cases and explore new ideas on how to improve the user experience.

New to Airflow or haven’t followed any of the recent DAG authoring enhancements? This talk is for you! We will go through various DAG authoring features like Setup/Teardown tasks (~2.7), Datasets (2.4), Dynamic Tasks (2.3) and Async tasks (2.2). You won’t be an expert after this short talk, however, you’ll have a head start when you write your next DAG, no hacks required.

Data contracts have been much discussed in the community of late, with a lot of curiosity around how to approach this concept in practice. We believe data contracts need a harmonizing layer to manage data quality in a uniform manner across a fragmented stack. We are calling this harmonizing layer the Control Plane for Data - powered by the common thread across these systems: metadata. For teams already orchestrating pipelines with Airflow, data contacts can be an effective way to process data that meets preset quality standards. With a control plane as a connecting layer, producers can build data contracts that consumers can rely on, ensuring DAGs only run when a contract is valid. Producers can govern how workflows should behave, and consumers receive the tooling they need to only opt into high quality data. Learn how to use data contracts and DataHub to make your Airflow pipelines more reliable - as well as other use cases that can help build a simpler, more flexible data stack.

New users starting with Airflow frequently encounter several challenges, ranging from the complexities of Containers and virtual environments to the Python dependency hell. Moreover, their familiarity with tools such as Docker, docker-compose, and Helm might be somewhat limited and even overkill. In contrast, seasoned Airflow users encounter their problems, encompassing configuration conflics with ongoing Airflow projects and intricacies stemming from Docker and docker-compose configurations and lack of visibility into all the projects. With airflowctl, users can install & setup Airflow using a single command. For existing users, they can use it to manage multiple Airflow projects with different Airflow versions on the same machine. This allows creating & debugging DAGs in an IDE seamlessly. Agenda for the call: Why airflowctl? Goal Current functionality & Demo Vision / Roadmap

Airflow is a popular choice for organizations looking to integrate open-source dbt within their existing data infrastructure. This talk will explore two primary methods of running dbt in Airflow: job-level and model-level. We’ll discuss the tradeoffs associated with each approach, highlighting the simplicity and efficiency of job-level orchestration, contrasted with the enhanced observability and control provided by model-level orchestration. We’ll also explain how the balance has shifted in recent years, with improvements to dbt core making model-level more efficient and innovative Airflow extensions like Cosmos making it easier to implement. Finally, we’ll provide benchmarks to help you determine which paradigm is the best fit for your needs.

Apache Airflow has over 650 Python dependencies. In case you did not know already, dependencies in Python are difficult subject. And Airflow has its own, custom ways of managing the dependencies. Airflow has a rather complex system to manage dependencies in their CI system, but this talk is not about it. This talk is directed to the users of Airflow who want to keep their dependencies updated, describing ways they can do it. This presentation will explain how to effectively manage and handle custom dependencies in Airflow. Jarek will guide you through practical solutions and best practices to make your Airflow experience with dependencies - yes you guessed it - a breeze.

Introduced in Airflow 2.4, Datasets are a foundational feature for authoring modular data pipelines. As DAGs grow to encompass a larger number of data sources and encompass multiple data transformation steps, they typically become less predictable in the timeliness of execution and less efficient. This talk focuses on leveraging Datasets to enable predictable and more efficient DAGs, by leveraging patterns from microservice architectures. Just as large monolithic applications were decomposed into micro-services to deliver more efficient scalability and faster development cycles, micropipelines have the same potential to radically transform data pipeline efficiency and development velocity. Using a simple financial analysis pipeline example, with data aggregation being done in Snowflake and prediction analysis in Spark, this talk outlines how to retain timelines of data pipelines while expanding data sets.