talk-data.com talk-data.com

Event

Airflow Summit 2024

2024-07-01 Airflow Summit Visit website ↗

Activities tracked

8

Airflow Summit 2024 program

Filtering by: Kubernetes ×

Sessions & talks

Showing 1–8 of 8 · Newest first

Search within this event →

Building in Resource Awareness and Event Dependency into Airflow

2024-07-01
session

In this talk, we will explore how adding custom dependency checks into Airflow’s scheduling system can elevate Airflow’s performance. We will specifically discuss how we added general upstream events dependency checking as well as how to make Airflow aware of used/available compute resources so that the system can better decide when and where to run a given task on Kubernetes infrastructure. We’ll cover why the existing dependency checking in Airflow is not sufficient in our use case, and why adding custom code to Airflow is needed. We’ll cover the pros and cons with this approach.

Empowering Airflow Users: A framework for performance testing and transparent resource optimization

2024-07-01
session

Apache Airflow is the backbone of countless data pipelines, but optimizing performance and resource utilization can be a challenge. This talk introduces a novel performance testing framework designed to measure, monitor, and improve the efficiency of Airflow deployments. I’ll delve into the framework’s modular architecture, showcasing how it can be tailored to various Airflow setups (Docker, Kubernetes, cloud providers). By measuring key metrics across schedulers, workers, triggers, and databases, this framework provides actionable insights to identify bottlenecks and compare performance across different versions or configurations. Attendees will learn: The motivation behind developing a standardized performance testing approach. Key design considerations and challenges in measuring performance across diverse Airflow environments. How to leverage the framework to construct test suites for different use cases (e.g., version comparison). Practical tips for interpreting performance test results and making informed decisions about resource allocation. How this framework contributes to greater transparency in Airflow release notes, empowering users with performance data.

Empowering Business Analysts with DAG Authoring IDE Running 8000 Workflows

2024-07-01
session

At Wix more often than not business analysts build workflows themselves to avoid data engineers being a bottleneck. But how do you enable them to create SQL ETLs starting when dependencies are ready and sending emails or refreshing Tableau reports when the work is done? One simple answer may be to use Airflow. The problem is every BA cannot be expected to know Python and Git so well that they will create thousands of DAGs easily. To bridge this gap we have built a web-based IDE, called Quix, that allows simple notebook-like development of Trino SQL workflows and converts them to Airflow DAGs when a user hits the “schedule” button. During the talk we will go through the problems of building a reliable and extendable DAG generating tool, why we preferred Airflow over Apache Oozie and also tricks (sharding, HA-mode, etc) allowing Airflow to run 8000 active DAGs on a single cluster in k8s.

How we Run 100 Airflow Environments and Millions of Tasks as a Part Time Job Using Kubernetes

2024-07-01
session

Balyasny Asset Management (BAM) is a diversified global investment firm founded in 2001 with over $20 billion in assets under management. We have more than 100 teams who run a variety of workloads that benefit from Orchestration and parallelization. Platform Engineers working for companies with K8s ecosystems can use their Kubernetes knowledge and leverage their platform to run Airflow and troubleshoot problems successfully. BAM’s Kubernetes Platform provides production-ready Airflow environments that automatically get Logging, Metrics, Alerting, Scalability, Storage from a range of File Systems, Authentication, Dashboards, Secrets Management, and specialized compute including GPU, CPU Optimized, Memory Optimized and even Windows. If you can run thousands of Pods on your Kubernetes Cluster then you can run thousands of Tasks without needing to do anything! The intention of this talk is to cover: Why K8s and Airflow work so well together How a team of Platform Engineers can leverage their Kubernetes Platform and knowledge to run millions of Tasks without Airflow being their primary focus Examples of where this model can start to fall apart at extreme scale

How we use Airflow at Booking to Orchestrate Big Data Workflows

2024-07-01
session

The talk will cover how we use Airflow at the heart of our Workflow Management Platform(WFM) at Booking.com, enabling our internal users to orchestrate big data workflows on Booking Data Exchange(BDX). High level overview of the talk: Adapting open source Airflow helm chart to spin up Airflow installation in Booking Kubernetes Service (BKS) Coming up with Workflow definition format (yaml) Conversion of workflow.yaml to workflow.py DAGs Usage of Deferrable operators to provide standard step templates to users Workspaces (collection of workflows), using it to ensure role based access to DAG permissions for users Using okta for authentication Alerting, monitoring, logging Plans to shift to Astronomer

Optimizing Critical Operations: Enhancing Robinhood's Workflow Journey with Airflow

2024-07-01
session

Airflow is widely used within Robinhood. In addition to traditional offline analytics use cases (to schedule ingestion and analytics workloads that populate our data lake), we also use Airflow in our backend services to orchestrate various workflows that are highly critical for the business, e.g: compliance and regulatory reporting, user facing reports and more. As part of this, we have evolved what we believe is a unique deployment architecture for Airflow. We have central schedulers that are responsible for workloads from multiple different teams, but the workflow tasks themselves run on workers owned by respective teams that are highly coupled with their backend services and codebase. Furthermore, Robinhood augmented Airflow with a bunch of customizations — airflow worker template for Kubernetes, enhanced observability, enhanced SLA detection, and a collection of operators, sensors, and plugins to tailor Airflow to their exact needs. This session is going to walk through how we grew our architecture and adapted Airflow to fit Robinhood’s variety of needs and use cases.

Orchestration of ML workloads via Airflow & GKE Batch

2024-07-01
session

During this talk we are going to given an overview of different orchestration approaches (Kubeflow, Ray, Airflow, etc.) when running ML workloads on Kubernetes and specifically we will focus on how to use Kubernetes Batch API and Kubernetes Operators to run complex ML workloads.

Streamline data science workflow development using Jupyter notebooks and Airflow

2024-07-01
session

Jupyter Notebooks are widely used by data scientists and engineers to prototype and experiment with data. However these engineers are often required to work with other data or platform engineers to productionize these experiments due to the complexity in navigating infrastructure and systems. In this talk, we will deep dive into this PR https://github.com/apache/airflow/pull/34840 and share how airflow can be leveraged as a platform to execute notebook pipelines (python, scala or spark) in dynamic environments like Kubernetes for various heterogeneous use cases. We will demonstrate how data scientists can use a Jupyter extension to easily build and manage such pipelines which are executed using Airflow streamlining data science workflow development and supercharging productivity