talk-data.com talk-data.com

Topic

Flink

Apache Flink

stream_processing batch_processing big_data

2

tagged

Activity Trend

7 peak/qtr
2020-Q1 2026-Q1

Activities

Showing filtered results

Filtering by: Airflow Summit 2025 ×

At Yahoo, we built a secure, scalable, and cost-efficient batch processing platform using Amazon MWAA to orchestrate Apache Flink jobs on EKS, managed by the Flink Kubernetes Operator. This setup enables dynamic job orchestration while meeting strict enterprise compliance standards. In this session, we’ll share how Airflow DAGs: Dynamically launch, monitor, and clean up isolated Flink clusters per batch job, improving resource efficiency. Securely fetch EKS kubeconfig, submit FlinkDeployment CRDs using FlinkKubernetesOperator, and poll job status using Airflow sensors. Integrate IAM for access control and meet Yahoo’s security requirements, including mutual TLS (mTLS) with Athenz. Optimize for cost and resilience through automated cleanup of jobs and the operator, and handle job failures and retries. Join us for practical strategies and lessons from Yahoo’s production-scale Flink workflows in a Kubernetes environment.

OpenLineage has simplified collecting lineage metadata across the data ecosystem by standardizing its representation in an extensible model. It enabled a whole ecosystem improving data pipeline reliability and ease of troubleshooting in production environments. In this talk, we’ll briefly introduce the OpenLineage model and explore how this metadata is collected from Airflow, Spark, dbt, and Flink. We’ll demonstrate how to extract valuable insights and outline practical benefits and common challenges when building ingestion, processing and storage for OpenLineage data. We will also briefly show how OpenLineage events can be used to observe data pipelines exhastively and the benefits that brings.