Apache Airflow is the backbone of countless data pipelines, but optimizing performance and resource utilization can be a challenge. This talk introduces a novel performance testing framework designed to measure, monitor, and improve the efficiency of Airflow deployments. I’ll delve into the framework’s modular architecture, showcasing how it can be tailored to various Airflow setups (Docker, Kubernetes, cloud providers). By measuring key metrics across schedulers, workers, triggers, and databases, this framework provides actionable insights to identify bottlenecks and compare performance across different versions or configurations. Attendees will learn: The motivation behind developing a standardized performance testing approach. Key design considerations and challenges in measuring performance across diverse Airflow environments. How to leverage the framework to construct test suites for different use cases (e.g., version comparison). Practical tips for interpreting performance test results and making informed decisions about resource allocation. How this framework contributes to greater transparency in Airflow release notes, empowering users with performance data.
talk-data.com
Speaker
Bartosz Jankiewicz
3
talks
Filter by Event / Source
Talks & appearances
3 activities · Newest first
Reliability is a complex and important topic. I will focus on both reliability definition and best practices. I will begin by reviewing the Apache Airflow components that impact reliability. I will subsequently examine those aspects, showing the single points of failure, mitigations, and tradeoffs. The journey starts with the scheduling process. I will focus on the aspects of Scheduler infrastructure and configuration that address reliability improvements. It doesn’t run in a vacuum therefore I’ll share my observations on the reliability aspect of Scheduler infrastructure. We recommend tasks to be idempotent but that is not always possible. I will share the challenges of running user’s code in the distributed architecture of Cloud Composer. I will refer to the volatility of some cloud resources and mitigation methods in various scenarios. Deferrability plays important part in the reliability, but there are also other elements we shouldn’t ignore.
This workshop is sold out Hands on workshop showing how easy it is to deploy Airflow in a public Cloud. Workshop consists of 3 parts: Setting up Airflow environment and CI/CD for DAG deployment Authoring a DAG Troubleshoot Airflow DAG/Task execution failures This workshop will be based on Cloud Composer ( https://cloud.google.com/composer ) This workshop is mostly targeted at Airflow newbies and users who would like to learn more about Cloud Composer and how to develop DAGs using Google Cloud Platform services like BigQuery, Vertex AI, Dataflow.