Apache Airflow has emerged as the defacto standard for data orchestration. Over the last couple of years, Airflow has also seen increasing adoption for ML and AI use cases. It has been almost four years since the release of Airflow 2 and as a community we have agreed that it’s time for a major foundational release in the form of Airflow 3. This talk will introduce the vision behind Airflow 3, including the emerging technology trends in the industry and how Airflow will evolve in response. Specifically, this will include an overview of the architectural changes in Airflow to support emerging use cases and distributed data infrastructure models. This talk will also introduce the major features and the desired outcomes of the release. Airflow 3 will be a foundational release and therefore this talk will similarly introduce the new concepts being introduced as part of Airflow 3, which may be fully realized in follow-on 3.x releases. The goal of this talk is to raise awareness about Airflow 3 and to get feedback from the Airflow community while the release is still in the development phase.
talk-data.com
Topic
Airflow
Apache Airflow
92
tagged
Activity Trend
Top Events
Apache Airflow relies on a silent symphony behind the scenes: its CI/CD (Continuous Integration/Continuous Delivery) and development tooling. This presentation explores the critical role these tools play in keeping Airflow efficient and innovative. We’ll delve into how robust CI/CD ensures bug fixes and improvements are seamlessly integrated, while well-maintained development tools empower developers to contribute effectively. Airflow’s power comes from a well-oiled machine – its CI/CD and development tools. This presentation dives into the world of these often-overlooked heroes. We’ll explore how seamless CI/CD pipelines catch and fix issues early, while robust development tools empower efficient coding and collaboration. Discover how you can use and contribute to a thriving Airflow ecosystem by ensuring these crucial tools stay in top shape.
Nowadays, conversational AI is no longer exclusive to large enterprises. It has become more accessible and affordable, opening up new possibilities and business opportunities. In this session, discover how you can leverage Generative AI as your AI pair programmer to suggest DAG code and recommend entire functions in real-time, directly from your editor. Visualize how to harness the power of ML, trained on billions of lines of code, to transform natural language prompts into coding suggestions. Seamlessly cycle through lines of code, complete function suggestions, and choose to accept, reject, or edit them. Witness firsthand how Generative AI provides recommendations based on the project’s context and style conventions. The objective is to equip you with techniques that allow you to spend less time on boilerplate and repetitive code patterns, and more time on what truly matters: building exceptional orchestration software.
In the last few years Large Language Models (LLMs) have risen to prominence as outstanding tools capable of transforming businesses. However, bringing such solutions and models to the business-as-usual operations is not an easy task. In this session, we delve into the operationalization of generative AI applications using MLOps principles, leading to the introduction of foundation model operations (FMOps) or LLM operations using Apache Airflow. We further zoom into aspects of expected people and process mindsets, new techniques for model selection and evaluation, data privacy, and model deployment. Additionally, know how you can use the prescriptive features of Apache Airflow to aid your operational journey. Whether you are building using out of the box models (open-source or proprietary), creating new foundation models from scratch, or fine-tuning an existing model, with the structured approaches described you can effectively integrate LLMs into your operations, enhancing efficiency and productivity without causing disruptions in the cloud or on-premises.
Ford Motor Company is undergoing a significant transformation, embracing AI and Machine Learning to power its smart mobility strategy, enhance customer experiences, and drive innovation in the automotive industry. Mach1ML, Ford’s multi-million dollar ML platform, plays a crucial role in this journey by empowering data scientists and engineers to efficiently build, deploy, and manage ML models at scale. This presentation will delve into how Mach1ML leverages Apache Airflow as its orchestration layer to tackle the challenges of complex ML workflows that include disparate systems, manual processes, security concerns, and deployment complexities. We will explore the benefits of using Airflow, such as increased efficiency, improved reliability, enhanced scalability, and faster time-to-value. Additionally, we will showcase how Mach1ML utilizes Airflow capabilities to generate reusable templates and streamline environment promotions to further empower Ford’s AI practitioners and accelerate the delivery of cutting-edge AI-powered solutions supporting the next generation of vehicles.
While Airflow is widely known for orchestrating and managing workflows, particularly in the context of data engineering, data science, ML (Machine Learning), and ETL (Extract, Transform, Load) processes, its flexibility and extensibility make it a highly versatile tool suitable for a variety of use cases beyond these domains. In fact, Cloudflare has publicly shared in the past an example on how Airflow was leveraged to build a system that automates datacenter expansions. In this talk, I will share a few more of our use cases beyond traditional data engineering, demonstrating Airflow’s sophisticated capabilities for orchestrating a wide variety of complex workflows, and discussing how Airflow played a crucial role in building some of the highly successful autonomous systems at Cloudflare, from handling automated bare metal server diagnostics and recovery at scale, to Zero Touch Provisioning that is helping us accelerate the roll out of inference-optimized GPUs in 150+ cities in multiple countries globally.
Cost management is a continuous challenge for our data teams at Astronomer. Understanding the expenses associated with running our workflows is not always straightforward, and identifying which process ran a query causing unexpected usage on a given day can be time-consuming. In this talk, we will showcase an Airflow Plugin and specific DAGs developed and used internally at Astronomer to track and optimize the costs of running DAGs. Our internal tool monitors Snowflake query costs, provides insights, and sends alerts for abnormal usage. With it, Astronomer identified and refactored its most costly DAGs, resulting in an almost 25% reduction in Snowflake spending. We will demonstrate how to track Snowflake-related DAG costs and discuss how the tool can be adapted to any database supporting query tagging like BigQuery, Oracle, and more. This talk will cover the implementation details and show how Airflow users can effectively adopt this tool to monitor and manage their DAG costs.
Many organizations struggle to create a well-orchestrated AI infrastructure, using separate and disconnected platforms for data processing, model training, and inference, which slows down development and increases costs. There’s a clear need for a unified system that can handle all aspects of AI development and deployment, regardless of the size of data or models. Join our breakout session to see how our comprehensive solution simplifies the development and deployment of large language models in production. Learn how to streamline your AI operations by implementing an end-to-end ML lifecycle on your custom data, including - automated LLM fine-tuning, LLM evaluation & LLM serving and LoRA deployments
Cloud availability zones and regions are not immune to outages. These zones regularly go down, and regions become unavailable due to natural disasters or human-caused incidents. Thus, if an availability zone or region goes down, so do your Airflow workflows and applications… unless your Airflow workflows function across multiple geographic locations. This hands-on session introduces you to the design patterns of multi-region Airflow workflows in the cloud, which can tolerate zone and region-level incidents. We will start with a traditional single-region configuration and then switch to a multi-region setting. By the end, we’ll have a working prototype of a multi-region Airflow pipeline that recovers from region-level outages within a few seconds, with no data loss or disruption to the application layer.
Airflow executes all tasks on the workers, including deferrable operators that must run on the workers before deferring to the triggerer. However, running some tasks directly from the triggerer can be beneficial in certain situations. This presentation will explain how deferrable operators function and examine ways to modify the Airflow implementation to enable tasks to run directly from the triggerer.
There are 3 certainties in life: death, taxes, and data pipelines failing. Pipelines may fail for a number of reasons: you may run out of memory, your credentials may expire, an upstream data source may not be reliable, etc. But there are patterns we can learn from! Join us as we walk through an analysis we’ve done on a massive dataset of Airflow failure logs. We’ll show how we used natural language processing and dimensionality reduction methods to explore the latent space of Airflow task failures in order to cluster, visualize, and understand failures. We’ll conclude the talk by walking through mitigation methods for common task failure reasons, and walk through how we can use Airflow to build an MLOps platform to turn this one-time analysis into a reliable, recurring activity.
Dive into the winning playbook of the 2023 World Series Champions Texas Rangers, and discover how they leverage Apache Airflow to streamline their data pipelines. In this session, we’ll explore how real-world data pipelines enable agile decision-making and drive competitive advantage in the high-stakes world of professional baseball, all by using Airflow as an orchestration platform. Whether you’re a seasoned data engineer or just starting out, this session promises actionable strategies to elevate your data orchestration game to championship levels.