talk-data.com talk-data.com

Event

Airflow Summit 2021

2021-07-01 Airflow Summit Visit website ↗

Activities tracked

8

Airflow Summit 2021 program

Filtering by: AI/ML ×

Sessions & talks

Showing 1–8 of 8 · Newest first

Search within this event →

Airflow Extensions for Streamlined ETL Backfilling

2021-07-01
session

Using Airflow as our scheduling framework, we ETL data generated by tens of millions of transactions every day to build the backbone for our reports, dashboards, and training data for our machine learning models. There are over 500 (and growing) such ingested and aggregated tables owned by multiple teams that contain intricate dependencies between one another. Given this level of complexity, it can become extremely cumbersome to coordinate backfills for any given table, when also taking into account all its downstream dependencies, aggregation intervals, and data availability. This talk will focus on how we customized and extended Airflow at Adyen to streamline our backfilling operations. This allows us to prevent mistakes and enable our product teams to keep launching fast and iterating.

Apache Airflow and Ray: Orchestrating ML at Scale

2021-07-01
session

As the Apache Airflow project grows, we seek both ways to incorporate rising technologies and novel ways to expose them to our users. Ray is one of the fastest-growing distributed computation systems on the market today. In this talk, we will introduce the Ray decorator and Ray backend. These features, built with the help of the Ray maintainers at Anyscale, will allow Data Scientists to natively integrate their distributed pandas, XGBoost, and TensorFlow jobs to their airflow pipelines with a single decorator. By merging the orchestration of Airflow and the distributed computation of Ray, this coordination of technologies opens Airflow users to a whole host of new possibilities when designing their pipelines.

Apache Airflow at Wise

2021-07-01
session

Wise (previously TransferWise) is a London-based fin-tech company. We build a better way of sending money internationally. At Wise we make great use of Airflow. More than 100 data scientists, analysts and engineers use Airflow every day to generate reports, prepare data, (re)train machine learning models and monitor services. My name is Alexandra, I’m a Machine Learning Engineer at Wise. Our team is responsible for building and maintaining Wise’s Airflow instances. In this presentation I would like to talk about three main things, our current setup, our challenges and our future plans with Airflow. We are currently transitioning from a single centralised Airflow instance into many segregated instances to increase reliability and limit access. We’ve learned a lot throughout this journey and looking to share these learnings with a wider audience.

Building the AirflowEventStream

2021-07-01
session
Jelle Munk (Adyen)

Or how to keep our traditional java application up-to-date on everything big data. At Adyen we process tens of millions of transactions a day, a number that rises every day. This means that generating reports, training machine learning models or any other operation that requires a bird’s eye view on weeks or months of data requires the use of Big Data technologies. We recently migrated to Airflow for scheduling all batch operations on our on-premise Big Data cluster. Some of these operations require input from our merchants or our support team. Merchants can for instance subscribe to reports, choose their preferred time zone, and even specify which columns they want included. After generating the reports, these reports then need to become available in our customer portal. So how do we keep track in our Customer Area which reports have been generated in Airflow? How do we launch ad-hoc backfills when one of our merchants subscribes to a new report? How do we integrate all of this into our existing monitoring pipeline? This talk will focus on how we have successfully integrated our big data platform with our existing Java web applications and how Airflow (with some simple add-ons) played a crucial role in achieving this.

Clearing Airflow obstructions

2021-07-01
session

Apache Airflow aims to speed the development of workflows, but developers are always ready to add bugs here and there. This talk illustrates a few pitfalls faced while developing workflows at the BBC to build machine learning models. The objective is to share some lessons learned and, hopefully, save others time. Some of the topics covered, with code examples: Tasks unsuitable to be run from within Airflow executors Plugins misusage Inconsistency while using an operator (Mis)configuration What to avoid during a workflow deployment Consequences of non-idempotent tasks

Creating Data Pipelines with Elyra, a visual DAG composer and Apache Airflow

2021-07-01
session

This presentation will detail how Elyra creates Jupyter Notebook, Python and R script- based pipelines without having to leave your web browser. The goal of using Elyra is to help construct data pipelines by surfacing concepts and patterns common in pipeline construction into a familiar, easy to navigate interface for Data Scientists and Engineers so they can create pipelines on their own. In Elyra’s Pipeline Editor UI, portions of Apache Airflow’s domain language are surfaced to the user and either made transparent or understandable through the use of tooltips or helpful notes in the proper context during pipeline construction. With these features, Elyra can rapidly prototype data workflows without the need to know or write any pipeline code. Lastly, we will look at what features we have planned on our roadmap for Airflow, including more robust Kubernetes integration and support for runtime specific components/operators. Project Home: https://github.com/elyra-ai/elyra

Guaranteeing pipeline SLAs and data quality standards with Databand

2021-07-01
session
Josh Benamram (Databand.ai) , Vinoo Ganesh (Veraset)

We’ve all heard the phrase “data is the new oil.” But really imagine a world where this analogy is more real, where problems in the flow of data - delays, low quality, high volatility - could bring down whole economies? When data is the new oil with people and businesses similarly reliant on it, how do you avoid the fires, spills, and crises? As data products become central to companies’ bottom line, data engineering teams need to create higher standards for the availability, completeness, and fidelity of their data. In this session we’ll demonstrate how Databand helps organizations guarantee the health of their Airflow pipelines. Databand is a data pipeline observability system that monitors SLAs and data quality issues, and proactively alerts users on problems to avoid data downtime. The session will be led by Josh Benamram, CEO and Cofounder of Databand.ai. Josh will be joined by Vinoo Ganesh, an experienced software engineer, system architect, and current CTO of Veraset, a data-as-a-service startup focused on understanding the world from a geospatial perspective. Join to see how Databand.ai can help you create stable, reliable pipelines that your business can depend on!

Productionizing ML Pipelines with Airflow, Kedro, and Great Expectations

2021-07-01
session

Machine Learning models can add value and insight to many projects, but they can be challenging to put into production due to problems like lack of reproducibility, difficulty maintaining integrations, and sneaky data quality issues. Kedro, a framework for creating reproducible, maintainable, and modular data science code, and Great Expectations, a framework for data validations, are two great open-source Python tools that can address some of these problems. Both integrate seamlessly with Airflow for flexible and powerful ML pipeline orchestration. In this talk we’ll discuss how you can leverage existing Airflow provider packages to integrate these tools to create sustainable, production-ready ML models.