At Yahoo, we built a secure, scalable, and cost-efficient batch processing platform using Amazon MWAA to orchestrate Apache Flink jobs on EKS, managed by the Flink Kubernetes Operator. This setup enables dynamic job orchestration while meeting strict enterprise compliance standards. In this session, we’ll share how Airflow DAGs: Dynamically launch, monitor, and clean up isolated Flink clusters per batch job, improving resource efficiency. Securely fetch EKS kubeconfig, submit FlinkDeployment CRDs using FlinkKubernetesOperator, and poll job status using Airflow sensors. Integrate IAM for access control and meet Yahoo’s security requirements, including mutual TLS (mTLS) with Athenz. Optimize for cost and resilience through automated cleanup of jobs and the operator, and handle job failures and retries. Join us for practical strategies and lessons from Yahoo’s production-scale Flink workflows in a Kubernetes environment.
talk-data.com
Topic
Flink
Apache Flink
stream_processing
batch_processing
big_data
2
tagged
Activity Trend
7
peak/qtr
2020-Q1
2026-Q1
Top Events
Data Engineering Podcast
21
O'Reilly Data Engineering Books
15
Databricks DATA + AI Summit 2023
8
DATA MINER Big Data Europe Conference 2020
6
Data + AI Summit 2025
5
AWS re:Invent 2024
4
PyData Amsterdam 2025
2
Data Council 2023
2
Airflow Summit 2022
2
Airflow Summit 2025
2
PyData London 2025
1
Microsoft Ignite 2023
1
Filtering by:
Airflow Summit 2025
×
OpenLineage has simplified collecting lineage metadata across the data ecosystem by standardizing its representation in an extensible model. It enabled a whole ecosystem improving data pipeline reliability and ease of troubleshooting in production environments. In this talk, we’ll briefly introduce the OpenLineage model and explore how this metadata is collected from Airflow, Spark, dbt, and Flink. We’ll demonstrate how to extract valuable insights and outline practical benefits and common challenges when building ingestion, processing and storage for OpenLineage data. We will also briefly show how OpenLineage events can be used to observe data pipelines exhastively and the benefits that brings.