Topic

Flink

Apache Flink

stream_processing batch_processing big_data

Activities

2

tagged

Activity Trend

7 peak/qtr

2020-Q1 2026-Q1

Top Events

Data Engineering Podcast 21 O'Reilly Data Engineering Books 15 Databricks DATA + AI Summit 2023 8 DATA MINER Big Data Europe Conference 2020 6 Data + AI Summit 2025 5 AWS re:Invent 2024 4 PyData Amsterdam 2025 2 Data Council 2023 2 Airflow Summit 2022 2 Airflow Summit 2025 2 PyData London 2025 1 Microsoft Ignite 2023 1

Top Speakers

Tobias Macey 21 Olena Kutsenko (Confluent) 3 Gunnar Morling (Decodable) 2 Julien Le Dem (Astronomer) 2 Fabian Hueske (Data Artisans) 2 Tathagata Das (Databricks) 2 Denny Lee (Databricks) 2 Ellen Friedman 2 Prakash Nandha Mukunthan 1 Eric Sammer (Decodable) 1 Francois Garillot 1 Willy Lulciuc (WeWork) 1

Activities

Showing filtered results

All Video Podcast Book

Filtering by: Airflow Summit 2022 ×

Airflow & Zeppelin: Better together

2022-07-01 · Airflow Summit 2022

session

by Jeff Zhang

Airflow Big Data Hive Presto Spark

Airflow is the almost de-facto standard job orchestration tool that is used in the production stage. But moving your job from the development stage in other tools to the production stage in Airflow is usually a big pain for lots of users. A major reason is due to the environment inconsistency between the development environment and the production environment. Apache Zeppelin is a web-based notebook that is integrated seamlessly with lots of popular big data engines, such as Spark, Flink, Hive, Presto and etc. So it is very suitable for the development stage. In this talk, I will talk about the seamless integration between Airflow & Zeppelin, so that you can develop your big data job in Zeppelin efficiently and move to Airflow easily without caring too much about issues caused by the environment inconsistency.

TFX on Airflow with delegation of processing to third party services

2022-07-01 · Airflow Summit 2022

session

by Israel Herraiz , Paul Balm

AI/ML Airflow Cloud Computing Dataflow GCP Spark TensorFlow

Get your ticket for this workshop Tensorflow Extended (TFX) can run machine learning pipelines on Airflow, but all the steps are run by default in the same workers where the Airflow DAG is running. This can lead to an excessive usage of resources, and breaks the assumption that Airflow is a scheduler; it becomes also the data processing platform. In this session, we will see how to use TFX with third party services, on top of Google Cloud Platform. The data processing steps can be run in Dataflow, Spark, Flink and other runners (parallelizing the processing of data and scaling up to petabytes), and the training steps can be run in Vertex or other external services. After this workshop, you will have learnt how to externalize any TFX heavyweight computing outside Airflow, while maintaining Airflow as the orchestrator for your machine learning pipelines.