talk-data.com talk-data.com

Event

Airflow Summit 2024

2024-07-01 Airflow Summit Visit website ↗

Activities tracked

98

Airflow Summit 2024 program

Sessions & talks

Showing 1–25 of 98 · Newest first

Search within this event →

10 years of Airflow: history, insights, and looking forward

2024-07-01
session

10 years after its creation, Airflow is stronger than ever: in last year’s Airflow survey, 81% of users said Airflow is important or very important to their business, 87% said their Airflow usage has grown over time, and 92% said they would recommend Airflow. In this panel discussion, we’ll celebrate a decade of Airflow and delve into how it became the highly recommended industry standard it is today, including history, pivotal moments, and the role of the community. Our panel of seasoned experts will also talk about where Airflow is going next, including future use cases like generative AI and the highly anticipated Airflow 3.0. Don’t miss this insightful exploration into one of the most influential tools in the data landscape.

Activating operational metadata with Airflow, Atlan and OpenLineage

2024-07-01
session

OpenLineage is an open standard for lineage data collection, integrated into the Airflow codebase, facilitating lineage collection across providers like Google, Amazon, and more. Atlan Data Catalog is a 3rd generation active metadata platform that is a single source of trust unifying cataloging, data discovery, lineage, and governance experience. We will demonstrate what OpenLineage is and how, with minimal and intuitive setup across Airlfow and Atlan, it presents unified workflows view, efficient cross-platform lineage collection, including column level, in various technologies (Python, Spark, dbt, SQL etc.) and clouds (AWS, Azure, GCP, etc.) - all orchestrated by Airflow. This integration enables further use case unlocks on automated metadata management by making the operational pipelines dataset-aware for self-service exploration. It also will demonstrate real world challenges and resolutions for lineage consumers in improving audit and compliance accuracy through column-level lineage traceability across the data estate. The talk will also briefly overview the most recent OpenLineage developments and planned future enhancements.

Adaptive Memory Scaling for Robust Airflow Pipelines

2024-07-01
session

At Vibrant Planet, we’re on a mission to make the world’s communities and ecosystems more resilient in the face of climate change. Our cloud-based platform is designed for collaborative scenario planning to tackle wildfires, climate threats, and ecosystem restoration on a massive scale. In this talk we will dive into how we are using Airflow. Particularly we will focus on how we’re making Airflow pipelines smarter and more resilient, especially when dealing with the task of processing large satellite imagery and other geospatial data. Self-Healing Pipelines: Discuss our self-healing pipelines which identify likely out-of-memory events and incrementally allocate more memory for task instance retries, ensuring robust and uninterrupted workflow execution. Initial Memory Recommendations: We’ll discuss how we set intelligent initial memory allocations for each task instance, enhancing resource efficiency from the outset.

A deep dive into Airflow configuration options for scalability

2024-07-01
session

Apache Airflow has a lot of configuration options. A change in some of these options can affect the performance of Airflow. If you are wondering why your Airflow instance is not running the number of tasks you expected it to run, after this talk, you will have a better understanding of the configuration options available for improving the number of tasks your Airflow instance can run. We will talk about the DAG parsing configuration options, options for scheduler scalability, etc., and the pros and cons of these options.

A Game of Constant Learning & Adjustment: Orchestrating ML Pipelines at the Philadelphia Phillies

2024-07-01
session

When developing Machine Learning (ML) models, the biggest challenges are often infrastructural. How do we deploy our model and expose an inference API? How can we retrain? Can we continuously evaluate performance and monitor model drift? In this talk, we will present how we are tackling these problems at the Philadelphia Phillies by developing a suite of tools that enable our software engineering and analytics teams to train, test, evaluate, and deploy ML models - that can be entirely orchestrated in Airflow. This framework abstracts away the infrastructural complexities that productionizing ML Pipelines presents and allows our analysts to focus on developing robust baseball research for baseball operations stakeholders across player evaluation, acquisition, and development. We’ll also look at how we use Airflow, MLflow, MLServer, cloud services, and GitHub Actions to architect a platform that supports our framework for all points of the ML Lifecycle.

AIP-63: DAG Versioning - Where are we?

2024-07-01
session

Join us as we check in on the current status of AIP-63: DAG Versioning. This session will explore the motivations behind AIP-63, the challenges faced by Airflow users in understanding and managing DAG history, and how it aims to address them. From tracking TaskInstance history to improving DAG representation in the UI, we’ll examine what we’ve already done and what’s next. We’ll also touch upon the potential future steps outlined in AIP-66 regarding the execution of specific DAG versions.

AI Reality Checkpoint: The Good, the Bad, and the Overhyped

2024-07-01
session

In the past 18 months, artificial intelligence has not just entered our workspaces – it has taken over. As we stand at the crossroads of innovation and automation, it’s time for a candid reflection on how AI has reshaped our professional lives, and to talk about where it’s been a game changer, where it’s falling short, and what’s about to shift dramatically in the short term. Since the release of ChatGPT in December 2022, I’ve developed a “first-reflex” to augment and accelerate nearly every task with AI. As a founder and CEO, this spans a wide array of responsibilities from fundraising, internal communications, legal, operations, product marketing, finance, and beyond. In this keynote, I’ll cover diverse use cases across all areas of business, offering a comprehensive view of AI’s impact. Join me as I sort out through this new reality and try and forecast the future of AI in our work. It’s time for a radical checkpoint. Everything’s changing fast. In some areas, AI has been a slam dunk; in others, it’s been frustrating as hell. And once a few key challenges are tackled, we’re on the cusp of a tsunami of transformation. 3 major milestones are right around the corner: top-human-level reasoning, solid memory accumulation and recall, and proper executive skills. How is this going to affect all of us?

Airflow and Control-M: Where Data Pipelines Meet Business Applications in Production

2024-07-01
session

This talk is presented by BMC Software With Airflow’s mainstream acceptance in the enterprise, the operational challenges of running with applications in production have emerged. At last year’s Airflow Summit in Toronto, three providers of Apache Airflow met to discuss “The Future of Airflow: What Users Want”. Among the user requirements in the session were: An improved security model allowing “Alice” and “Bob” to run their single DAGs without each requiring a separate Airflow cluster, while still adhering to their organization’s compliance requirements. An “Orchestrator of Orchestrators” relationship in which Airflow oversees the myriad orchestrators embedded in many tools and provided by cloud vendors. That panel discussion described what Airflow users now understand to be mandatory for their workloads in enterprise production, and defined the exact operational requirements our customers have successfully tackled for decades. Join us in this session to learn how Control-M’s Airflow integration helps data engineers do what they need to do with Airflow and gives IT Ops the key to deliver enterprise business application results in production.

Airflow and multi-cluster Slurm working together

2024-07-01
session

Meteosim provides environmental services, mainly based on weather and air quality intelligence, and helps customers make operational and tactical decisions and understand their companies’ environmental impact. We introduced Airflow a couple of years ago to replace a huge Crontab file and we currently have around 7000 DAG Runs per day. In this presentation we will introduce the hardest challenge we had to overcome: adapting Airflow to run on multiple Slurm-managed HPC clusters by using deferrable operators. Slurm is an open-source cluster manager, used especially in science-based companies or organizations and many supercomputers worldwide. By using Slurm our simulations run on bare metal nodes, eliminating overhead and speeding up the intensive calculations. Moreover, we will present our use case: how we use Airflow to provide our services and how we streamlined the DAG creation process, so our Product Engineers need to write a few lines of code and all DAGs are standardized and stored in a database.

Airflow-as-an-Engine: Lessons from Open-Source Applications Built On Top of Airflow

2024-07-01
session

Airflow is often used for running data pipelines, which themselves connect with other services through the provider system. However, it is also increasingly used as an engine under-the-hood for other projects building on top of the DAG primitive. For example, Cosmos is a framework for automatically transforming dbt DAGs into Airflow DAGs, so that users can supplement the developer experience of dbt with the power of Airflow. This session dives into how a select group of these frameworks (Cosmos, Meltano, Chronon) use Airflow as an engine for orchestrating complex workflows their systems depend on. In particular, we will discuss ways that we’ve increased Airflow performance to meet application-specific demands (high-task-count Cosmos DAGs, streaming jobs in Chronon), new Airflow features that will evolve how these frameworks use Airflow under the hood (DAG versioning, dataset integrations), and paths we see these projects taking over the next few years as Airflow grows. Airflow is not just a DAG platform, it’s an application platform!

Airflow as a workflow for Self Service Based Ingestion

2024-07-01
session

Our Idea to platformize Ingestion pipelines is driven via Airflow in the background and streamline the entire ingestion process for Self Service. With customer experience on top of it and making data ingestion fool proof as part of Analytics data team, Airflow is just complementing for our vision.

Airflow at Burns & McDonnell | Orchestration from zero to 100

2024-07-01
session
Bonnie Why (Burns & McDonnell)

As the largest employee-owned engineering and construction firm in the United States, Burns & McDonnell has a massive amount of data. Not only that, it’s hard to pinpoint which source system has the data we need. Our solution to this challenge is to build a unified information platform — a single source of truth where all of our data is searchable, trustworthy, and accessible to our employee-owners and the projects that need it. Everyone’s data is important and everyone’s use case is a priority, so how can we get this done quickly? In this session, I will tell you all about how we went from having zero knowledge in Airflow to ingesting many unique and disconnected data sources into our data lakehouse in less than a day. Come hear the story about how our data team at Burns & McDonnell is using Airflow as an orchestrator to create a scalable, trustworthy data platform that will empower our system to evolve with the ever-changing technology landscape.

Airflow at Ford: A Job Router Training Advance Driver Assistance Systems

2024-07-01
session

Ford Motor Company operates extensively across various nations. The Data Operations (DataOps) team for Advanced Driver Assistance Systems (ADAS) at Ford is tasked with the processing of terabyte-scale daily data from lidar, radar, and video. To manage this, the DataOps team is challenged with orchestrating diverse, compute-intensive pipelines across both on-premises infrastructure and the GCP and deal with sensitive of customer data across both environments The team is also responsible for facilitating the execution of on-demand, compute-intensive algorithms at scale through. To achieve these objectives, the team employs Astronomer/Airflow at the core of its strategic approach. This involves various deployments of Astronomer/Airflow that integrate seamlessly and securely (via Apigee) to initiate batch data processing and ML jobs on the cloud, as well as compute-intensive computer vision tasks on-premises, with essential alerting provided through the ELK stack. This presentation will delve into the architecture and strategic planning surrounding the hybrid batch router, highlighting its pivotal role in promoting rapid innovation and scalability in the development of ADAS features.

Airflow at NCR Voyix: Streamlining ML workflows development with Airflow

2024-07-01
session

NCR Voyix Retail Analytics AI team offers ML products for retailers while embracing Airflow as its MLOps Platform. As the team is small and there have been twice as many data scientists as engineers, we encountered challenges in making Airflow accessible to the scientists: As they come from diverse programming backgrounds, we needed an architecture enabling them to develop production-ready ML workflows without prior knowledge of Airflow. Due to dynamic product demands, we had to implement a mechanism to interchange Airflow operators effortlessly. As workflows serve multiple customers, they should be easily configurable and simultaneously deployable. We came up with the following architecture to deal with the above: Enabling our data scientists to formulate ML workflows as structured Python files. Seamlessly converting the workflows into Airflow DAGs while aggregating their steps to be executed on different Airflow operators. Deploying DAGs via CI/CD’s UI to the DAGs folder for all customers while considering definitions for each in their configuration files. In this session, we will cover Airflow’s evolution in our team and review the concepts of our architecture.

Airflow Datasets and Pub/Sub for Dynamic DAG Triggering

2024-07-01
session

Looking for a way to streamline your data workflows and master the art of orchestration? As we navigate the complexities of modern data engineering, Airflow’s dynamic workflow and complex data pipeline dependencies are starting to become more and more common nowadays. In order to empower data engineers to exploit Airflow as the main orchestrator, Airflow Datasets can be easily integrated in your data journey. This session will showcase the Dynamic Workflow orchestration in Airflow and how to manage multi-DAGs dependencies with Multi-Dataset listening. We’ll take you through a real-time data pipeline with Pub/Sub messaging integration and dbt in Google Cloud environment, to ensure data transformations are triggered only upon new data ingestion, moving away from rigid time-based scheduling or the use of sensors and other legacy ways to trigger a DAG.

Airflow - Path to Industry Orchestration Standard

2024-07-01
session

In the realm of data engineering, machine learning pipelines and using cloud and web services there is a huge demand for orchestration technologies. Apache Airflow belongs to the most popular orchestration technologies or even is the most popular one. In this presentation we are going to focus these aspects of Airflow that make it so popular and whether it became the orchestration industry standard.

Airflow, Spark, and LLMs: Turbocharging MLOps at ASAPP

2024-07-01
session

This talk will explore ASAPP’s use of Apache Airflow to streamline and optimize our machine learning operations (MLOps). Key highlights include: Integrating with our custom Spark solution for achieving speedup, efficiency, and cost gains for generative AI transcription, summarization and intent categorization pipelines Different design patterns of integrating with efficient LLM servers - like TGI/vllm/tensor-RT for Summarization pipelines with/without Spark. An overview of batched LLM inference using Airflow as opposed to real time inference outside of it [Tentative] Possible extension of this scaffolding to Reinforcement Learning from Human Feedback (RLHF) and Reinforcement Learning from AI Feedback (RLAIF) for fine-tuning LLMs, using Airflow as the orchestrator. Additionally, the talk will cover ASAPP’s MLOps journey with Airflow over the past few years, including an overview of our cloud infrastructure, various data backends, and sources. The primary focus will be on the machine learning workflows at ASAPP, rather than the data workflows, providing a detailed look at how Airflow enhances our MLOps processes.

Airflow UI Roadmap

2024-07-01
session

Soon we will finally switch to a 100% React UI with a full separation between the API and UI as well. While we are doing such a big change, let’s also take the opportunity to imagine whole new interfaces vs just simply modernizing the existing views. How can we use design to help you better understand what is going on with your DAG? Come listen to some of our proposed ideas and bring your own big ideas as the second half will be an open discussion.

Airflow Unleashed: Making Hundreds of Deployments A Day at Coinbase

2024-07-01
session

At Coinbase, Airflow is the backbone of ELT, supported by a vibrant community of over 500 developers. This vast engagement results in a continuous stream of enhancements, with hundreds of commits tested and released daily. However, this scale of development presents its own set of challenges, especially in deployment velocity. Traditional deployment methodologies proved inadequate, significantly impeding the productivity of our developers. Recognizing the critical need for a solution that matches our pace of innovation, we developed AirAgent: a bespoke, fully autonomous deployer designed specifically for Airflow. Capable of deploying updates hundreds of times a day on both staging and production environments, AirAgent has transformed our development lifecycle, enabling immediate iteration and drastically improving developer velocity. This talk aims to unveil the inner workings of AirAgent, highlighting its design principles, deployment strategies, and the challenges we overcame in its implementation. By sharing our journey, we hope to offer insights and strategies that can benefit others in the Airflow community, encouraging a shift towards a high-frequency deployment workflow.

A New DAG Paradigm: Less Airflow more DAGs

2024-07-01
session

Astronomer’s data team recently underwent a major shift in how we work with Airflow. We’ll deep dive into the challenges which prompted that change, how we addressed them and where we are now. This re-architecture included: Switching to dataset scheduling and micro-pipelines to minimize failures and increase reliability. Implementing a Control DAG for complex dependency management and full end-to-end pipeline visibility. Standardized Task Groups for quick onboarding and scalability. With Airflow managing itself, we can once again focus on the data rather than the operational overhead. As proof we’ll share our favorite statistics from the terabyte of data we process daily revealing insights into how the world’s data teams use Airflow.

Architecting Blockchain ETL Orchestration: Circle's Airflow Usecase

2024-07-01
session

This talk focuses on exploring the implementation of Apache Airflow for Blockchain ETL orchestration, indexing, and the adoption of GitOps at Circle. IT will cover CICD tips, architectural choices for managing Blockchain data at scale, engineering practices to enable data scientists and some learnings from production.

Automated Testing and Deployment of DAGs

2024-07-01
session

DAG integrity is critical. So are coding conventions, consistency in standards for the group. In this talk, we will share the various lessons learned for testing/verifying our DAGs as part of our GitHub workflows [ for testing as part of the pull request process, and for automated deployment - eventually to production - once merged ]. We will dig into how we have unlocked additional efficiencies, catch errors before they get deployed, and generally how we are better off for having both Airflow & plenty of checks in our CI, before we merge/deploy.

Behaviour Driven Development in Airflow

2024-07-01
session

Behaviour Driven Development can, in the simplest of terms, be described as Test Driven Development, only readable. It is of course more than that, but that is not the aim of this talk. This talk aims to show: How to write tests before you write a single line of Airflow code Create reusable and readable steps for setting up tests, in a given-when-then manner. Test rendering and execution of your DAG’s tasks Real world examples from a monorepo containing multiple Airflow projects Written only with pytest, and some code I stole from smart people in github.com/apache/airflow/tests

Boost Airflow Monitoring and Alerting with Automation Analytics & Intelligence by Broadcom

2024-07-01
session

This talk is presented by Broadcom. Airflow’s “workflow as code” approach has many benefits, including enabling dynamic pipeline generation and flexibility and extensibility in a seamless development environment. However, what challenges do you face as you expand your Airflow footprint in your organization? What if you could enhance Airflow’s monitoring capabilities, forecast DAG and task executions, obtain predictive alerting, visualize trends, and get more robust logging? Broadcom’s Automation Analytics & Intelligence (AAI) offers advanced analytics for workload automation for cloud and on-premises automation. It connects easily with Airflow to offer improved visibility into dependencies between tasks in Airflow DAGs along with the workload’s critical path, dynamic SLA management, and more. Join our presentation to hear more about how AAI can help you improve service delivery. We will also lead a workshop that will allow you to dive deeper into how easy it is to install our Airflow Connector and get started visualizing your Airflow DAGs to optimize your workload and identify issues before they impact your business.