Airflow

Airflow and Control-M: Where Data Pipelines Meet Business Applications in Production

2024-07-01 · Airflow Summit 2024

session

by Basil Faruqui

Cloud Computing Cyber Security

This talk is presented by BMC Software With Airflow’s mainstream acceptance in the enterprise, the operational challenges of running with applications in production have emerged. At last year’s Airflow Summit in Toronto, three providers of Apache Airflow met to discuss “The Future of Airflow: What Users Want”. Among the user requirements in the session were: An improved security model allowing “Alice” and “Bob” to run their single DAGs without each requiring a separate Airflow cluster, while still adhering to their organization’s compliance requirements. An “Orchestrator of Orchestrators” relationship in which Airflow oversees the myriad orchestrators embedded in many tools and provided by cloud vendors. That panel discussion described what Airflow users now understand to be mandatory for their workloads in enterprise production, and defined the exact operational requirements our customers have successfully tackled for decades. Join us in this session to learn how Control-M’s Airflow integration helps data engineers do what they need to do with Airflow and gives IT Ops the key to deliver enterprise business application results in production.

Airflow and multi-cluster Slurm working together

2024-07-01 · Airflow Summit 2024

session

by Eloi Codina Torras

Meteosim provides environmental services, mainly based on weather and air quality intelligence, and helps customers make operational and tactical decisions and understand their companies’ environmental impact. We introduced Airflow a couple of years ago to replace a huge Crontab file and we currently have around 7000 DAG Runs per day. In this presentation we will introduce the hardest challenge we had to overcome: adapting Airflow to run on multiple Slurm-managed HPC clusters by using deferrable operators. Slurm is an open-source cluster manager, used especially in science-based companies or organizations and many supercomputers worldwide. By using Slurm our simulations run on bare metal nodes, eliminating overhead and speeding up the intensive calculations. Moreover, we will present our use case: how we use Airflow to provide our services and how we streamlined the DAG creation process, so our Product Engineers need to write a few lines of code and all DAGs are standardized and stored in a database.

Airflow-as-an-Engine: Lessons from Open-Source Applications Built On Top of Airflow

2024-07-01 · Airflow Summit 2024

session

by Ian Moritz

Cosmos dbt Meltano Data Streaming

Airflow is often used for running data pipelines, which themselves connect with other services through the provider system. However, it is also increasingly used as an engine under-the-hood for other projects building on top of the DAG primitive. For example, Cosmos is a framework for automatically transforming dbt DAGs into Airflow DAGs, so that users can supplement the developer experience of dbt with the power of Airflow. This session dives into how a select group of these frameworks (Cosmos, Meltano, Chronon) use Airflow as an engine for orchestrating complex workflows their systems depend on. In particular, we will discuss ways that we’ve increased Airflow performance to meet application-specific demands (high-task-count Cosmos DAGs, streaming jobs in Chronon), new Airflow features that will evolve how these frameworks use Airflow under the hood (DAG versioning, dataset integrations), and paths we see these projects taking over the next few years as Airflow grows. Airflow is not just a DAG platform, it’s an application platform!

Airflow as a workflow for Self Service Based Ingestion

2024-07-01 · Airflow Summit 2024

session

by Ramesh Babu

Analytics

Our Idea to platformize Ingestion pipelines is driven via Airflow in the background and streamline the entire ingestion process for Self Service. With customer experience on top of it and making data ingestion fool proof as part of Analytics data team, Airflow is just complementing for our vision.

Airflow at Burns & McDonnell | Orchestration from zero to 100

2024-07-01 · Airflow Summit 2024

session

by Bonnie Why (Burns & McDonnell)

Data Lakehouse

As the largest employee-owned engineering and construction firm in the United States, Burns & McDonnell has a massive amount of data. Not only that, it’s hard to pinpoint which source system has the data we need. Our solution to this challenge is to build a unified information platform — a single source of truth where all of our data is searchable, trustworthy, and accessible to our employee-owners and the projects that need it. Everyone’s data is important and everyone’s use case is a priority, so how can we get this done quickly? In this session, I will tell you all about how we went from having zero knowledge in Airflow to ingesting many unique and disconnected data sources into our data lakehouse in less than a day. Come hear the story about how our data team at Burns & McDonnell is using Airflow as an orchestrator to create a scalable, trustworthy data platform that will empower our system to evolve with the ever-changing technology landscape.

Airflow at Ford: A Job Router Training Advance Driver Assistance Systems

2024-07-01 · Airflow Summit 2024

session

by Doug Rogan , Vasantha Kosuri Marshall

AI/ML Astronomer Cloud Computing DataOps ELK GCP

Ford Motor Company operates extensively across various nations. The Data Operations (DataOps) team for Advanced Driver Assistance Systems (ADAS) at Ford is tasked with the processing of terabyte-scale daily data from lidar, radar, and video. To manage this, the DataOps team is challenged with orchestrating diverse, compute-intensive pipelines across both on-premises infrastructure and the GCP and deal with sensitive of customer data across both environments The team is also responsible for facilitating the execution of on-demand, compute-intensive algorithms at scale through. To achieve these objectives, the team employs Astronomer/Airflow at the core of its strategic approach. This involves various deployments of Astronomer/Airflow that integrate seamlessly and securely (via Apigee) to initiate batch data processing and ML jobs on the cloud, as well as compute-intensive computer vision tasks on-premises, with essential alerting provided through the ELK stack. This presentation will delve into the architecture and strategic planning surrounding the hybrid batch router, highlighting its pivotal role in promoting rapid innovation and scalability in the development of ADAS features.

Airflow at NCR Voyix: Streamlining ML workflows development with Airflow

2024-07-01 · Airflow Summit 2024

session

by Shahar Epstein

AI/ML Analytics CI/CD MLOps Python

NCR Voyix Retail Analytics AI team offers ML products for retailers while embracing Airflow as its MLOps Platform. As the team is small and there have been twice as many data scientists as engineers, we encountered challenges in making Airflow accessible to the scientists: As they come from diverse programming backgrounds, we needed an architecture enabling them to develop production-ready ML workflows without prior knowledge of Airflow. Due to dynamic product demands, we had to implement a mechanism to interchange Airflow operators effortlessly. As workflows serve multiple customers, they should be easily configurable and simultaneously deployable. We came up with the following architecture to deal with the above: Enabling our data scientists to formulate ML workflows as structured Python files. Seamlessly converting the workflows into Airflow DAGs while aggregating their steps to be executed on different Airflow operators. Deploying DAGs via CI/CD’s UI to the DAGs folder for all customers while considering definitions for each in their configuration files. In this session, we will cover Airflow’s evolution in our team and review the concepts of our architecture.

Airflow Datasets and Pub/Sub for Dynamic DAG Triggering

2024-07-01 · Airflow Summit 2024

session

by Andrea Bombino , Nawfel Bacha

Cloud Computing Data Engineering dbt GCP Pub/Sub

Looking for a way to streamline your data workflows and master the art of orchestration? As we navigate the complexities of modern data engineering, Airflow’s dynamic workflow and complex data pipeline dependencies are starting to become more and more common nowadays. In order to empower data engineers to exploit Airflow as the main orchestrator, Airflow Datasets can be easily integrated in your data journey. This session will showcase the Dynamic Workflow orchestration in Airflow and how to manage multi-DAGs dependencies with Multi-Dataset listening. We’ll take you through a real-time data pipeline with Pub/Sub messaging integration and dbt in Google Cloud environment, to ensure data transformations are triggered only upon new data ingestion, moving away from rigid time-based scheduling or the use of sensors and other legacy ways to trigger a DAG.

Airflow - Path to Industry Orchestration Standard

2024-07-01 · Airflow Summit 2024

session

by Rafal Biegacz , Filip Knapik

AI/ML Cloud Computing Data Engineering

In the realm of data engineering, machine learning pipelines and using cloud and web services there is a huge demand for orchestration technologies. Apache Airflow belongs to the most popular orchestration technologies or even is the most popular one. In this presentation we are going to focus these aspects of Airflow that make it so popular and whether it became the orchestration industry standard.

Airflow, Spark, and LLMs: Turbocharging MLOps at ASAPP

2024-07-01 · Airflow Summit 2024

session

by Udit Saxena

AI/ML Cloud Computing GenAI LLM MLOps Spark

This talk will explore ASAPP’s use of Apache Airflow to streamline and optimize our machine learning operations (MLOps). Key highlights include: Integrating with our custom Spark solution for achieving speedup, efficiency, and cost gains for generative AI transcription, summarization and intent categorization pipelines Different design patterns of integrating with efficient LLM servers - like TGI/vllm/tensor-RT for Summarization pipelines with/without Spark. An overview of batched LLM inference using Airflow as opposed to real time inference outside of it [Tentative] Possible extension of this scaffolding to Reinforcement Learning from Human Feedback (RLHF) and Reinforcement Learning from AI Feedback (RLAIF) for fine-tuning LLMs, using Airflow as the orchestrator. Additionally, the talk will cover ASAPP’s MLOps journey with Airflow over the past few years, including an overview of our cloud infrastructure, various data backends, and sources. The primary focus will be on the machine learning workflows at ASAPP, rather than the data workflows, providing a detailed look at how Airflow enhances our MLOps processes.

Airflow UI Roadmap

2024-07-01 · Airflow Summit 2024

session

by Brent Bovenzi (Astronomer)

API React

Soon we will finally switch to a 100% React UI with a full separation between the API and UI as well. While we are doing such a big change, let’s also take the opportunity to imagine whole new interfaces vs just simply modernizing the existing views. How can we use design to help you better understand what is going on with your DAG? Come listen to some of our proposed ideas and bring your own big ideas as the second half will be an open discussion.

Airflow Unleashed: Making Hundreds of Deployments A Day at Coinbase

2024-07-01 · Airflow Summit 2024

session

by Jianlong Zhong

ETL/ELT

At Coinbase, Airflow is the backbone of ELT, supported by a vibrant community of over 500 developers. This vast engagement results in a continuous stream of enhancements, with hundreds of commits tested and released daily. However, this scale of development presents its own set of challenges, especially in deployment velocity. Traditional deployment methodologies proved inadequate, significantly impeding the productivity of our developers. Recognizing the critical need for a solution that matches our pace of innovation, we developed AirAgent: a bespoke, fully autonomous deployer designed specifically for Airflow. Capable of deploying updates hundreds of times a day on both staging and production environments, AirAgent has transformed our development lifecycle, enabling immediate iteration and drastically improving developer velocity. This talk aims to unveil the inner workings of AirAgent, highlighting its design principles, deployment strategies, and the challenges we overcame in its implementation. By sharing our journey, we hope to offer insights and strategies that can benefit others in the Airflow community, encouraging a shift towards a high-frequency deployment workflow.

A New DAG Paradigm: Less Airflow more DAGs

2024-07-01 · Airflow Summit 2024

session

by Maggie Stark , Marion Azoulai

Astronomer

Astronomer’s data team recently underwent a major shift in how we work with Airflow. We’ll deep dive into the challenges which prompted that change, how we addressed them and where we are now. This re-architecture included: Switching to dataset scheduling and micro-pipelines to minimize failures and increase reliability. Implementing a Control DAG for complex dependency management and full end-to-end pipeline visibility. Standardized Task Groups for quick onboarding and scalability. With Airflow managing itself, we can once again focus on the data rather than the operational overhead. As proof we’ll share our favorite statistics from the terabyte of data we process daily revealing insights into how the world’s data teams use Airflow.

Architecting Blockchain ETL Orchestration: Circle's Airflow Usecase

2024-07-01 · Airflow Summit 2024

session

by Nathaniel Rose (Circle)

Blockchain CI/CD ETL/ELT

This talk focuses on exploring the implementation of Apache Airflow for Blockchain ETL orchestration, indexing, and the adoption of GitOps at Circle. IT will cover CICD tips, architectural choices for managing Blockchain data at scale, engineering practices to enable data scientists and some learnings from production.

Automated Testing and Deployment of DAGs

2024-07-01 · Airflow Summit 2024

session

by Austin Bennett

CI/CD GitHub

DAG integrity is critical. So are coding conventions, consistency in standards for the group. In this talk, we will share the various lessons learned for testing/verifying our DAGs as part of our GitHub workflows [ for testing as part of the pull request process, and for automated deployment - eventually to production - once merged ]. We will dig into how we have unlocked additional efficiencies, catch errors before they get deployed, and generally how we are better off for having both Airflow & plenty of checks in our CI, before we merge/deploy.

Behaviour Driven Development in Airflow

2024-07-01 · Airflow Summit 2024

session

by Ole Christian Langfjæran (Unacast)

GitHub

Behaviour Driven Development can, in the simplest of terms, be described as Test Driven Development, only readable. It is of course more than that, but that is not the aim of this talk. This talk aims to show: How to write tests before you write a single line of Airflow code Create reusable and readable steps for setting up tests, in a given-when-then manner. Test rendering and execution of your DAG’s tasks Real world examples from a monorepo containing multiple Airflow projects Written only with pytest, and some code I stole from smart people in github.com/apache/airflow/tests

Boost Airflow Monitoring and Alerting with Automation Analytics & Intelligence by Broadcom

2024-07-01 · Airflow Summit 2024

session

by Jennifer Chisik

Analytics Cloud Computing

This talk is presented by Broadcom. Airflow’s “workflow as code” approach has many benefits, including enabling dynamic pipeline generation and flexibility and extensibility in a seamless development environment. However, what challenges do you face as you expand your Airflow footprint in your organization? What if you could enhance Airflow’s monitoring capabilities, forecast DAG and task executions, obtain predictive alerting, visualize trends, and get more robust logging? Broadcom’s Automation Analytics & Intelligence (AAI) offers advanced analytics for workload automation for cloud and on-premises automation. It connects easily with Airflow to offer improved visibility into dependencies between tasks in Airflow DAGs along with the workload’s critical path, dynamic SLA management, and more. Join our presentation to hear more about how AAI can help you improve service delivery. We will also lead a workshop that will allow you to dive deeper into how easy it is to install our Airflow Connector and get started visualizing your Airflow DAGs to optimize your workload and identify issues before they impact your business.

Bronco: Managing Terraform at Scale with Airflow

2024-07-01 · Airflow Summit 2024

session

by Jack Cusick

Cloud Computing Terraform

Airflow is not just purpose-built for data applications. It is a job scheduler on steroids. This is exactly what a cloud platform team needs: a configurable and scalable automation tool that can handle thousands of administrative tasks. Come learn how one enterprise platform team used Airflow to support cloud infrastructure at unprecedented scale.

Building in Resource Awareness and Event Dependency into Airflow

2024-07-01 · Airflow Summit 2024

session

by Roberto Santamaria , Anandhi Murali

Kubernetes

In this talk, we will explore how adding custom dependency checks into Airflow’s scheduling system can elevate Airflow’s performance. We will specifically discuss how we added general upstream events dependency checking as well as how to make Airflow aware of used/available compute resources so that the system can better decide when and where to run a given task on Kubernetes infrastructure. We’ll cover why the existing dependency checking in Airflow is not sufficient in our use case, and why adding custom code to Airflow is needed. We’ll cover the pros and cons with this approach.

Building on Cosmos: Making dbt on Airflow Easy

2024-07-01 · Airflow Summit 2024

session

by Lewis Macdonald , Ethan Stone

Cosmos dbt Redshift Snowflake SQL

Balyasny Asset Management (BAM) is a diversified global investment firm founded in 2001 with over $20 billion in assets under management. As dbt took hold at BAM, we had multiple teams building dbt projects against Snowflake, Redshift, and SQL Server. The common question was: How can we quickly and easily productionise our projects? Airflow is the orchestrator of choice at BAM, but our dbt users ranged from Airflow power users to people who’d never heard of Airflow before. We built a single solution on top of Cosmos that allowed us to: Decouple the dbt project from the Airflow repository Have each dbt node run as a separate Airflow task Allow users to run dbt with little to no Airflow knowledge Enable users to have fine-grained control over how dbt is run and to combine it with other Airflow tasks Provide observability, monitoring, and alerting.

talk-data.com

Activity Trend

Top Events

Top Speakers

Airflow and Control-M: Where Data Pipelines Meet Business Applications in Production

Airflow and multi-cluster Slurm working together

Airflow-as-an-Engine: Lessons from Open-Source Applications Built On Top of Airflow

Airflow as a workflow for Self Service Based Ingestion

Airflow at Burns & McDonnell | Orchestration from zero to 100

Airflow at Ford: A Job Router Training Advance Driver Assistance Systems

Airflow at NCR Voyix: Streamlining ML workflows development with Airflow

Airflow Datasets and Pub/Sub for Dynamic DAG Triggering

Airflow - Path to Industry Orchestration Standard

Airflow, Spark, and LLMs: Turbocharging MLOps at ASAPP

Airflow UI Roadmap

Airflow Unleashed: Making Hundreds of Deployments A Day at Coinbase

A New DAG Paradigm: Less Airflow more DAGs

Architecting Blockchain ETL Orchestration: Circle's Airflow Usecase

Automated Testing and Deployment of DAGs

Behaviour Driven Development in Airflow

Boost Airflow Monitoring and Alerting with Automation Analytics & Intelligence by Broadcom

Bronco: Managing Terraform at Scale with Airflow

Building in Resource Awareness and Event Dependency into Airflow

Building on Cosmos: Making dbt on Airflow Easy