talk-data.com talk-data.com

Event

Airflow Summit 2024

2024-07-01 Airflow Summit Visit website ↗

Activities tracked

6

Airflow Summit 2024 program

Filtering by: GitHub ×

Sessions & talks

Showing 1–6 of 6 · Newest first

Search within this event →

A Game of Constant Learning & Adjustment: Orchestrating ML Pipelines at the Philadelphia Phillies

2024-07-01
session

When developing Machine Learning (ML) models, the biggest challenges are often infrastructural. How do we deploy our model and expose an inference API? How can we retrain? Can we continuously evaluate performance and monitor model drift? In this talk, we will present how we are tackling these problems at the Philadelphia Phillies by developing a suite of tools that enable our software engineering and analytics teams to train, test, evaluate, and deploy ML models - that can be entirely orchestrated in Airflow. This framework abstracts away the infrastructural complexities that productionizing ML Pipelines presents and allows our analysts to focus on developing robust baseball research for baseball operations stakeholders across player evaluation, acquisition, and development. We’ll also look at how we use Airflow, MLflow, MLServer, cloud services, and GitHub Actions to architect a platform that supports our framework for all points of the ML Lifecycle.

Automated Testing and Deployment of DAGs

2024-07-01
session

DAG integrity is critical. So are coding conventions, consistency in standards for the group. In this talk, we will share the various lessons learned for testing/verifying our DAGs as part of our GitHub workflows [ for testing as part of the pull request process, and for automated deployment - eventually to production - once merged ]. We will dig into how we have unlocked additional efficiencies, catch errors before they get deployed, and generally how we are better off for having both Airflow & plenty of checks in our CI, before we merge/deploy.

Behaviour Driven Development in Airflow

2024-07-01
session

Behaviour Driven Development can, in the simplest of terms, be described as Test Driven Development, only readable. It is of course more than that, but that is not the aim of this talk. This talk aims to show: How to write tests before you write a single line of Airflow code Create reusable and readable steps for setting up tests, in a given-when-then manner. Test rendering and execution of your DAG’s tasks Real world examples from a monorepo containing multiple Airflow projects Written only with pytest, and some code I stole from smart people in github.com/apache/airflow/tests

DAGify - Enterprise Scheduler Migration Accelerator for Airflow

2024-07-01
session

DAGify is a highly extensible, template driven, enterprise scheduler migration accelerator that helps organizations speed up their migration to Apache Airflow. While DAGify does not claim to migrate 100% of existing scheduler functionality it aims to heavily reduce the manual effort it takes for developers to convert their enterprise scheduler formats into Python Native Airflow DAGs. DAGify is an open source tool under Apache 2.0 license and available on Github ( https://github.com/GoogleCloudPlatform/dagify) . In this session we will introduce DAGify, its use cases and demo its functionality by converting Control-M XML files to Airflow DAGs. Additionally we will highlight DAGify’s “no-code” extensibility by creating custom conversion templates to map Control-M functionality to Airflow operators.

Integrating dbt with Airflow: Overcoming performance hurdles

2024-07-01
session

The integration between dbt and Airflow is a popular topic in the community, both in previous editions of Airflow Summit, in Coalesce and the #airflow-dbt Slack channel. Astronomer Cosmos ( https://github.com/astronomer/astronomer-cosmos/ ) stands out as one of the libraries that strives to enhance this integration, having over 300k downloads per month. During its development, we’ve encountered various performance challenges in terms of scheduling and task execution. While we’ve managed to address some, others remain to be resolved. This talk describes how Cosmos works, the improvements made over the last 1.5 years, and the roadmap. It also aims to collect feedback from the community on how we can further improve the experience of running dbt in Airflow.

Streamline data science workflow development using Jupyter notebooks and Airflow

2024-07-01
session

Jupyter Notebooks are widely used by data scientists and engineers to prototype and experiment with data. However these engineers are often required to work with other data or platform engineers to productionize these experiments due to the complexity in navigating infrastructure and systems. In this talk, we will deep dive into this PR https://github.com/apache/airflow/pull/34840 and share how airflow can be leveraged as a platform to execute notebook pipelines (python, scala or spark) in dynamic environments like Kubernetes for various heterogeneous use cases. We will demonstrate how data scientists can use a Jupyter extension to easily build and manage such pipelines which are executed using Airflow streamlining data science workflow development and supercharging productivity