“More data lineage” has been second most popular feature request in Airflow Survey 2023. However, despite the integration of OpenLineage in Airflow 2.7 through AIP-53, the most popular Operator in Airflow - PythonOperator - isn’t covered by lineage support. With addition of TaskFlow API, Airflow Datasets, Airflow ObjectStore, and many other small changes, writing DAGs without using other operators is easier than ever. And that’s why lineage collection in Airflow moves beyond covering specific Operators, to covering Hooks and Object Storage. In this session, you’ll learn how newly added AIP-62 will allow you author DAGs the way you love, while also keeping benefits of a data pipeline well covered by lineage.
talk-data.com
Topic
Airflow
Apache Airflow
92
tagged
Activity Trend
Top Events
Airflow, an open-source platform for orchestrating complex data workflows, is widely adopted for its flexibility and scalability. However, as workflows grow in complexity and scale, optimizing Airflow performance becomes crucial for efficient execution and resource utilization. This session delves into the importance of optimizing Airflow performance and provides strategies, techniques, and best practices to enhance workflow execution speed, reduce resource consumption, and improve system efficiency. Attendees will gain insights into identifying performance bottlenecks, fine-tuning workflow configurations, leveraging advanced features, and implementing optimization strategies to maximize pipeline throughput. Whether you’re a seasoned Airflow user or just getting started, this session equips you with the knowledge and tools needed to optimize your Airflow deployments for optimal performance and scalability. We’ll also explore topics such as DAG writing best practices, monitoring and updating Airflow configurations, and database performance optimization, covering unused indexes, missing indexes, and minimizing table and index bloat.
Airflow is widely used within Robinhood. In addition to traditional offline analytics use cases (to schedule ingestion and analytics workloads that populate our data lake), we also use Airflow in our backend services to orchestrate various workflows that are highly critical for the business, e.g: compliance and regulatory reporting, user facing reports and more. As part of this, we have evolved what we believe is a unique deployment architecture for Airflow. We have central schedulers that are responsible for workloads from multiple different teams, but the workflow tasks themselves run on workers owned by respective teams that are highly coupled with their backend services and codebase. Furthermore, Robinhood augmented Airflow with a bunch of customizations — airflow worker template for Kubernetes, enhanced observability, enhanced SLA detection, and a collection of operators, sensors, and plugins to tailor Airflow to their exact needs. This session is going to walk through how we grew our architecture and adapted Airflow to fit Robinhood’s variety of needs and use cases.
During this talk we are going to given an overview of different orchestration approaches (Kubeflow, Ray, Airflow, etc.) when running ML workloads on Kubernetes and specifically we will focus on how to use Kubernetes Batch API and Kubernetes Operators to run complex ML workloads.
DAG Authors, while constructing DAGs, generally use native libraries provided by Airflow in conjunction with python libraries available over public PyPI repositories. But sometimes, DAG authors need to construct DAG using libraries that are either in-house or not available over public PyPI repositories. This poses a serious challenge for users who want to run their custom code with Airflow DAGs, particularly when Airflow is deployed in a cloud-native fashion. Traditionally, these packages are baked in Airflow Docker images. This won’t work post deployment and is super impractical if your library is under development. We propose a solution that creates a dedicated Airflow global python environment that dynamically generates the requirements, establishes a version-compatible pyenv adhering to Airflow’s policies, and manages custom pip repository authentication seamlessly. Importantly, the service executes these steps in a fail-safe manner, not compromising core components. Join us as we discuss the solution to this common problem, touching upon the design, and seeing the solution in action. We also candidly discuss some challenges, and the shortcomings of the proposed solution.
Profiling Airflow tasks can be difficult, specially in remote environments. In this talk I will demonstrate how we can leverage the capabilities of Airflow’s plugin mechanism to selectively run Airflow tasks within the context of a profiler and with the help of operator links and custom views make the results available to the user. The content of this talk can provide inspiration on how Airflow may in the future allow the gathering of custom task metrics and make those metrics easily accessible.
Feeling trapped in a maze of duplicate Airflow DAG code? We were too! That’s why we embarked on a journey to build a centralized library, eliminating redundancy and unlocking delightful efficiency. Join us as we share: The struggles of managing repetitive code across DAGs Our approach to a centralized library, revealing design and implementation strategies The amazing results: reduced development time, clean code, effortless maintenance, and a framework that creates efficient and self-documenting DAGs Let’s break free from complexity and duplication, and build a brighter Airflow future together!
Imagine a world where writing Airflow tasks in languages like Go, R, Julia, or maybe even Rust is not just a dream but a native capability. Say goodbye to BashOperators; welcome to the future of Airflow task execution. Here’s what you can expect to learn from this session: Multilingual Tasks: Explore how we empower DAG authors to write tasks in any language while retaining seamless access to Airflow Variables and Connections. Simplified Development and Testing: Discover how a standardized interface for task execution promises to streamline development efforts and elevate code maintainability. Enhanced Scalability and Remote Workers: Learn how enabling tasks to run on remote workers opens up possibilities for seamless deployment on diverse platforms, including Windows and remote Spark or Ray clusters. Experience the convenience of effortless deployments as we unlock new avenues for Airflow usage. Join us as we embark on an exploratory journey to shape the future of Airflow task execution. Your insights and contributions are invaluable as we refine this vision together. Let’s chart a course towards a more versatile, efficient, and accessible Airflow ecosystem.
This usecase shows how we deal with data of different varieties from different sources. Each source sends data in different layout, timings, structures, location patterns sizes. The goal is to process the files within SLA and send them out. This a complex multi step processing pipeline that involves multiple spark jobs, api based integrations with microservices, resolving unique ids, deduplication and filtering. Note that this is an event driven system, but not a streaming data system. The files are of gigabyte scale, and each day the data being processed is of terabyte scale. We will be talking about how to make DAG creation and business logic building a “low-code no-code process” so that non technical analysts can write business logic and light developers can deploy DAGs without much manual effort. Every aspect is either source specific or source-agnostic configuration driven. Airflow was chosen to enable easy DAG building, scaling, monitoring, troubleshooting and rerunning.
In today’s data-driven era, ensuring data reliability and enhancing our testing and development capabilities are paramount. Local unit testing has its merits but falls short when dealing with the volume of big data. One major challenge is running Spark jobs pre-deployment to ensure they produce expected results and handle production-level data volumes. In this talk, we will discuss how Autodesk leveraged Astronomer to improve pipeline development. We’ll explore how it addresses challenges with sensitive and large data sets that cannot be transferred to local machines or non-production environments. Additionally, we’ll cover how this approach supports over 10 engineers working simultaneously on different feature branches within the same repo. We will highlight the benefits, such as conflict-free development and testing, and eliminating concerns about data corruption when running DAGs on production Airflow servers. Join me to discover how solutions like Astronomer empower developers to work with increased efficiency and reliability. This talk is perfect for those interested in big data, cloud solutions, and innovative development practices.
In this talk, we’ll discuss how Instacart leverages Apache Airflow to orchestrate a vast network of data pipelines, powering both our core infrastructure and dbt deployments. As a data-driven company, Airflow plays a critical role in enabling us to execute large and intricate pipelines securely, compliantly, and at scale. We’ll delve into the following key areas: a. High-Throughput Cluster Management: We’ll explore how we manage and maintain our Airflow cluster, ensuring the efficient execution of over 2,000 DAGs across diverse use cases. b. Centralized Airflow Vision: We’ll outline our plans for establishing a company-wide, centralized Airflow cluster, consolidating all Airflow instances at Instacart. c. Custom Airflow Tooling: We’ll showcase the custom tooling we’ve developed to manage YML-based DAGs, execute DAGs on external ECS workers, leverage Terraform for cluster deployment, and implement robust cluster monitoring at scale. By sharing our extensive experience with Airflow, we aim to contribute valuable insights to the Airflow community.
AI workloads are becoming increasingly complex, with unique requirements around data management, compute scalability, and model lifecycle management. In this session, we will explore the real-world challenges users face when operating AI at scale. Through real-world examples, we will uncover common pitfalls in areas like data versioning, reproducibility, model deployment, and monitoring. Our practical guide will highlight strategies for building robust and scalable AI platforms leveraging Airflow as the orchestration layer and AWS for its extensive AI/ML capabilities. We will showcase how users have tackled these challenges, streamlined their AI workflows, and unlocked new levels of productivity and innovation.
Airflow’s power comes from its vast ecosystem, but securing this intricate web requires a united front. This talk unveils a groundbreaking collaborative effort between the Python Software Foundation (PSF), the Apache Software Foundation (ASF), the Airflow Project Management Committee (PMC), and Alpha-Omega Fund - aimed at securing not only Airflow, but the whole ecosystem. We’ll explore this new project dedicated to improving security across the Airflow landscape.
As Apache Airflow evolves, a key shift is emerging: the move from task-centric to data-aware orchestration. Traditionally, Airflow has focused on managing tasks efficiently, with limited visibility into the data those tasks manipulate. However, the rise of data-centric workflows demands a new approach—one that puts data at the forefront. This talk will explore how embedding deeper data insights into Airflow can align with modern users’ needs, reducing complexity and enhancing workflow efficiency. We’ll discuss how this evolution can transform Airflow into a more intuitive and powerful tool, better suited to today’s data-driven environments.
Before Airflow 2.9, user management was part of core Airflow, therefore modifying it or customizing it to fit user needs was not an easy process. Authentication and authorization managers (auth managers), is a new concept introduced in Airflow 2.9. It was introduced as extensible user management (AIP-56), allowing Airflow users to have a flexible way to integrate with organization’s identity services. Organizations want a single place to manage permissions and FAB (Flask App Builder) made it difficult to achieve. In this talk, after explaining the concept of auth managers and why we built this, we will show you how you can leverage the new auth manager interface to build an authorization service for Airflow based on your existing identity provider. We will see that auth managers can be leveraged to change considerably how users and their permissions are managed in an Airflow environment. Finally, we will dive deep into the AWS auth manager as an alternative auth manager and see some different usages as examples.
Jupyter Notebooks are widely used by data scientists and engineers to prototype and experiment with data. However these engineers are often required to work with other data or platform engineers to productionize these experiments due to the complexity in navigating infrastructure and systems. In this talk, we will deep dive into this PR https://github.com/apache/airflow/pull/34840 and share how airflow can be leveraged as a platform to execute notebook pipelines (python, scala or spark) in dynamic environments like Kubernetes for various heterogeneous use cases. We will demonstrate how data scientists can use a Jupyter extension to easily build and manage such pipelines which are executed using Airflow streamlining data science workflow development and supercharging productivity
At Bloomberg, it is our team’s responsibility to ensure the timely delivery to our clients worldwide of a vast dataset comprising approximately 5 billion data points on roughly 50 million loans and over 1.4 million securities, disclosed twice a month by three major government-sponsored mortgage entities. Ingesting this data so we can create and derive complex data structures to be consumed by our applications for our clients has been our biggest challenge. In this talk, we will discuss our transition from a manually-managed spreadsheet-based system to an automated centralized orchestration tool, and how Apache Airflow has helped make the process more transparent, predictable, and visible.
As organizations grow, the task of creating and managing Airflow DAGs efficiently becomes a challenge. In this talk, we will delve into innovative approaches to streamlining Airflow DAG creation using YAML. By leveraging YAML configuration, we allow users to dynamically generate Airflow DAGs without requiring Python expertise or deep knowledge of Airflow primitives. We will showcase the significant benefits of this approach, including eliminating duplicate configurations, simplifying DAG management for a large group of workflows, and ultimately enhancing productivity within large organizations. Join us to learn practical strategies to optimize workflow orchestration, reduce development overhead, and facilitate seamless collaboration across teams.
At Stripe, compliance with regulations is of utmost importance, and ensuring the integrity of production data is crucial. To address this challenge, Stripe developed a powerful system called User Scope Mode (USM), which allows users to safely and efficiently test new or existing Airflow pipelines without the risk of corrupting production data. USM takes care of automatically overwriting the necessary configurations for Airflow pipelines, enabling users to test their production-ready pipelines locally with ease. This approach empowers Stripe’s teams to iterate and refine their workflows without the burden of manual setup or the fear of disrupting live operations. In this talk, we’ll dive into the inner workings of USM and explore how it has transformed Stripe’s development and testing processes. Discover how this system seamlessly integrates with Airflow, allowing users to validate their pipelines with confidence and agility, all while maintaining the highest standards of compliance and data integrity.
Since version 2.7 and the advent of AIP-51, Airflow has started to fully support the creation of custom executors. Before we dive into the components of an executor and how they work, we will briefly discuss the Executor Decoupling initiative which allowed this new feature. Once we understand the parts required, we will explore the process of crafting our own executors, using real-world examples, and demonstrations of executors developed within the Amazon Provider Package as a guide. By demystifying the process of executor creation and emphasizing the opportunities for contribution, we aim to empower Airflow users and providers to harness the full potential of custom executors, enriching the Airflow ecosystem as a whole!