talk-data.com

Event

Airflow Summit 2022

2022-07-01 Airflow Summit Visit website ↗

Activities tracked

Airflow Summit 2022 program

Filtering by: Spark ×

Top Speakers

Jarek Potiuk 3 Rafal Biegacz 2 John Jackson 2 Filip Knapik 2 Leah Cole 2 Ross Turk 2 Kaxil Naik 1 Ash Berlin-Taylor 1 Maxime Beauchemin 1 Brent Bovenzi 1 Daniel Imberman 1 Ephraim Anierobi 1

Sessions & talks

Showing 1–4 of 4 · Newest first

Search within this event →

Airflow at high scale for Autonomous Driving

2022-07-01

session

Philipp Lang , Anton Ivanov

Airflow postgresql Spark

This talk highlights a large-scale use case of Airflow to orchestrate workflows for an Autonomous Driving project based in Germany. To support our customer in the aim of producing their first Level-3 Autonomous Driving vehicle in Germany, we are utilising Airflow as a state-of-the-art tool to orchestrate workloads running on a large-scale HPC platform. In this talk, we will describe our Airflow setup deployed on OpenShift that is capable of running thousands of tasks in parallel and contains various custom improvements optimised for our use case. We will in detail discuss how we achieved to integrate multiple components with airflow such as a PostgreSQL database, a highly available RabbitMQ message broker and a fully integrated IAM solution. In particular, we will describe bottlenecks which we have encountered in our journey towards scaling up airflow, and how we mitigate them. We will also detail on the custom improvements we implemented on both airflow code base and deployment setup, such as an enhanced airflow logging framework, improvements on the Spark submit operator, and a feature to re-deploy airflow without business downtime. The talk will be concluded with an outlook on future improvements that we anticipate, as well as beneficial features we believe will be helpful for the community.

Airflow & Zeppelin: Better together

2022-07-01

session

Jeff Zhang

Airflow Flink Big Data Hive Presto Spark

Airflow is the almost de-facto standard job orchestration tool that is used in the production stage. But moving your job from the development stage in other tools to the production stage in Airflow is usually a big pain for lots of users. A major reason is due to the environment inconsistency between the development environment and the production environment. Apache Zeppelin is a web-based notebook that is integrated seamlessly with lots of popular big data engines, such as Spark, Flink, Hive, Presto and etc. So it is very suitable for the development stage. In this talk, I will talk about the seamless integration between Airflow & Zeppelin, so that you can develop your big data job in Zeppelin efficiently and move to Airflow easily without caring too much about issues caused by the environment inconsistency.

Data Lineage with Apache Airflow and Apache Spark

2022-07-01

session

Michael Collado

Airflow Spark

Data within today’s organizations has become increasingly distributed and heterogeneous. It can’t be contained within a single brain, a single team, or a single platform…but it still needs to be comprehensible, especially when something unexpected happens. Data lineage can help by tracing the relationships between datasets and providing a cohesive graph that places them in context. OpenLineage provides a standard for lineage collection that spans multiple platforms, including Apache Airflow and Apache Spark. In this session, Michael Collado from Datakin will show how to trace data lineage and useful operational metadata in Apache Spark and Airflow pipelines, and talk about how OpenLineage fits in the context of data pipeline operations and provides insight into the larger data ecosystem.

TFX on Airflow with delegation of processing to third party services

2022-07-01

session

Israel Herraiz , Paul Balm

AI/ML Airflow Flink Cloud Computing Dataflow GCP

Get your ticket for this workshop Tensorflow Extended (TFX) can run machine learning pipelines on Airflow, but all the steps are run by default in the same workers where the Airflow DAG is running. This can lead to an excessive usage of resources, and breaks the assumption that Airflow is a scheduler; it becomes also the data processing platform. In this session, we will see how to use TFX with third party services, on top of Google Cloud Platform. The data processing steps can be run in Dataflow, Spark, Flink and other runners (parallelizing the processing of data and scaling up to petabytes), and the training steps can be run in Vertex or other external services. After this workshop, you will have learnt how to externalize any TFX heavyweight computing outside Airflow, while maintaining Airflow as the orchestrator for your machine learning pipelines.

Airflow Summit 2022

Top Topics

Top Speakers