Kubernetes

Benchmarking the Performance of Dynamically Generated DAGs

2025-07-01 · Airflow Summit 2025

session

by Rahul Vats , Tatiana Al-Chueyr Martins (Astronomer)

Airflow Astronomer Cosmos GitHub

As teams scale their Airflow workflows, a common question is: “My DAG has 5,000 tasks—how long will it take to run in Airflow?” Beyond execution time, users often face challenges with dynamically generated DAGs, such as: Delayed visualization in the Airflow UI after deployment. High resource consumption, leading to Kubernetes pod evictions and out-of-memory errors. While estimating the resource utilization in a distributed data platform is complex, benchmarking can provide crucial insights. In this talk, we’ll share our approach to benchmarking dynamically generated DAGs with Astronomer Cosmos ( https://github.com/astronomer/astronomer-cosmos) , covering: Designing representative and extensible baseline tests. Setting up an isolated, distributed infrastructure for benchmarking. Running reproducible performance tests. Measuring DAG run times and task throughput. Evaluating CPU & memory consumption to optimize deployments. By the end of this session, you will have practical benchmarks and strategies for making informed decisions about evaluating the performance of DAGs in Airflow.

Building an MLOps Platform for 300+ ML/DS Specialists on Top of Airflow

2025-07-01 · Airflow Summit 2025

session

by Aleksandr Shirokov , Roman Khomenko , Tarasov Alexey

AI/ML Airflow Data Science GitLab Hadoop MLOps S3 Spark

As your organization scales to 20+ data science teams and 300+ DS/ML/DE engineers, you face a critical challenge: how to build a secure, reliable, and scalable orchestration layer that supports both fast experimentation and stable production workflows. We chose Airflow — and didn’t regret it! But to make it truly work at our scale, we had to rethink its architecture from the ground up. In this talk, we’ll share how we turned Airflow into a powerful MLOps platform through its core capability: running pipelines across multiple K8s GPU clusters from a single UI (!) using per-cluster worker pools. To support ease of use, we developed MLTool — our own library for fast and standardized DAG development, integrated Vault for secure secret management across teams, enabled real-time logging with S3 persistence and built a custom SparkSubmitOperator for Kerberos-authenticated Spark/Hadoop jobs in Kubernetes. We also streamlined the developer experience — users can generate a GitLab repo and deploy a versioned pipeline to prod in under 10 minutes! We’re proud of what we’ve built — and our users are too. Now we want to share it with the world!

Designing Scalable Retrieval-Augmented Generation (RAG) Pipelines at SAP with Apache Airflow

2025-07-01 · Airflow Summit 2025

session

by Sagar Sharma

AI/ML Airflow API GenAI LLM RAG SAP

At SAP Business AI, we’ve transformed Retrieval-Augmented Generation (RAG) pipelines into enterprise-grade powerhouses using Apache Airflow. Our Generative AI Foundations Team developed a cutting-edge system that effectively grounds Large Language Models (LLMs) with rich SAP enterprise data. Powering Joule for Consultants, our innovative AI copilot, this pipeline manages the seamless ingestion, sophisticated metadata enrichment, and efficient lifecycle management of over a million structured and unstructured documents. By leveraging Airflow’s Dynamic DAGs, TaskFlow API, XCom, and Kubernetes Event-Driven Autoscaling (KEDA), we achieved unprecedented scalability and flexibility. Join our session to discover actionable insights, innovative scaling strategies, and a forward-looking vision for Pipeline-as-a-Service, empowering seamless integration of customer-generated content into scalable AI workflows

Do you trust Airflow with your money? (We do!)

2025-07-01 · Airflow Summit 2025

session

by Sabrina Liu

Airflow

Airflow is wonderfully, frustratingly complex - and so is global finance! Stripe has very specific needs all over the planet, and we have customized Airflow to adapt to the variety and rigor that we need to grow the GDP of the internet. In this talk, you’ll learn: How we support independent DAG change management for over 500 different teams running over 150k tasks. How we’ve customized Airflow’s Kubernetes integration to comply with Stripe’s unique compliance requirements. How we’ve built on Airflow to support no-code data pipelines.

From Complexity to Simplicity with TaskHarbor: Trendyol's Path to a Unified Orchestration Platform

2025-07-01 · Airflow Summit 2025

session

by Salih Goktug Kose , Burak Özdemir

Airflow Cloud Computing GCP Terraform YAML

At Trendyol, Turkey’s leading e-commerce company, Apache Airflow powers our task orchestration, handling DAGs with 500+ tasks, complex interdependencies, and diverse environments. Managing on-prem Airflow instances posed challenges in scalability, maintenance, and deployment. To address these, we built TaskHarbor, a fully managed orchestration platform with a hybrid architecture—combining Airflow on GKE with on-prem resources for optimal performance and efficiency. This talk covers how we: Enabled seamless DAG synchronization across environments using GCS Fuse. Optimized workload distribution via GCP’s HTTPS & TCP Load Balancers. Automated infrastructure provisioning (GKE, CloudSQL, Kubernetes) using Terraform. Simplified Airflow deployments by replacing Helm YAML files with a custom templating tool, reducing configurations to 10-15 lines. Built a fully automated deployment pipeline, ensuring zero developer intervention. We enhanced efficiency, reliability, and automation in hybrid orchestration by embracing a scalable, maintainable, and cloud-native strategy. Attendees will obtain practical insights into architecting Airflow at scale and optimizing deployments.

Lessons learned for scaling up Airflow 3 in Public Cloud

2025-07-01 · Airflow Summit 2025

session

by Augusto Hidalgo , Przemek Więch

Airflow Cloud Computing GCP Cloud Composer postgresql

Apache Airflow 3 is a new state-of-the-art version of Airflow. For many users who plan to adopt Airflow 3 it’s important to understand how Airflow 3 behaves from performance perspective compared to Airflow 2. This presentation is going to present performance results for various Airflow 3 configurations and provides potential Airflow 3 adopters good understanding of its performance. The reference Airflow 3 configuration will be using Kubernetes cluster as a compute layer, PostgreSQL as Airflow Database and would be performed on Google Cloud Platform. Performance tests will be performed using community version of performance tests framework and there might be references to Cloud Composer (managed service for Apache Airflow). The tests will be done in production-grade configurations that might be good references for Airflow community users. Users will be provided with comparison of Airflow 3 and Airflow 2 from performance standpoint Users also will learn how to optimize Airflow scheduler performance by understanding DAG file processing, task scheduling and configuring Scheduler to run tens of thousands of DAGs/tasks in Airflow 3

Navigating Secure and Cost-Efficient Flink Batch on Kubernetes with Airflow

2025-07-01 · Airflow Summit 2025

session

by Prakash Nandha Mukunthan , Purshotam Shah

Airflow Flink Cyber Security

At Yahoo, we built a secure, scalable, and cost-efficient batch processing platform using Amazon MWAA to orchestrate Apache Flink jobs on EKS, managed by the Flink Kubernetes Operator. This setup enables dynamic job orchestration while meeting strict enterprise compliance standards. In this session, we’ll share how Airflow DAGs: Dynamically launch, monitor, and clean up isolated Flink clusters per batch job, improving resource efficiency. Securely fetch EKS kubeconfig, submit FlinkDeployment CRDs using FlinkKubernetesOperator, and poll job status using Airflow sensors. Integrate IAM for access control and meet Yahoo’s security requirements, including mutual TLS (mTLS) with Athenz. Optimize for cost and resilience through automated cleanup of jobs and the operator, and handle job failures and retries. Join us for practical strategies and lessons from Yahoo’s production-scale Flink workflows in a Kubernetes environment.

Scaling Airflow at OpenAI

2025-07-01 · Airflow Summit 2025

session

by Howie Wang , Ping Zhang

Airflow LLM

This talk shares how we scaled and hardened OpenAI’s Airflow deployment to orchestrate thousands of workflows on Kubernetes. We’ll cover key architecture choices, scaling strategies, and reliability improvements - along with practical lessons learned.

Task failures troubleshooting based on Airflow & Kubernetes signals

2025-07-01 · Airflow Summit 2025

session

by Khadija Al Ahyane

Airflow

Per Airflow community survey, Kubernetes is the most popular compute platform used to run Airflow and when run on Kubernetes, Airflow gains, out of the box, lots of benefits like monitoring, reliability, ease of deployment, scalability and autoscaling. On the other hand, running Airflow on Kubernetes means running a sophisticated distributed system on another distributed system which makes troubleshooting of Airflow tasks and DAGs failures harder. This session tackles that bottleneck head-on, introducing a practical approach to building an automated diagnostic pipeline for Airflow on Kubernetes. Imagine offloading tedious investigations to a system that, on task failure, automatically collects and correlates key signals from Kubernetes components (linking Airflow tasks to specific Pods and their events), KubernetesGKE monitoring, and relevant logs—pinpointing root causes and suggesting actionable fixes. Attendees will leave with a clear understanding of common Airflow-on-Kubernetes failure patterns—and more importantly, a blueprint and practical strategies to reduce MTTR and boost team efficiency.

Naci Simsek: Lets Deeploy Flink - Uncovering Hidden Depths of Yarn, Docker, Kubernetes & Beyond

2025-06-27 · DATA MINER Big Data Europe Conference 2020 Watch

video

by Naci Simsek

Flink Docker

The platform behind a generic noninvasive neuromotor interface for human-computer interaction

2025-06-24 · Google NY Site Reliability Engineering (SRE) Tech Talks, 24 Jun 2025

talk

by Stepan Hruda (Meta Reality Labs) , Rudi Chiarito

AI/ML Data Collection

In 2024, the Ctrl-labs team at Meta Reality Labs published a preprint, introducing the science behind a new neural input device worn on the wrist. This talk will cover the custom Kubernetes-based platform underlying both the research/ML workloads and the data collection. We'll talk about the challenges of serving 'only' hundreds of internal scientists and engineers, while also supporting data collection from thousands of participants. We'll cover the evolution of the services and codebase, the reliability tradeoffs, the growing pains and the custom tools that we had to build.

Deploying Unity Catalog OSS on Kubernetes: Simplifying Infrastructure Management

2025-06-11 · Data + AI Summit 2025 Watch

lightning_talk

by Vasilii Bulatov (Nebius)

Cloud Computing Data Governance Cyber Security

In modern data infrastructure, efficient and scalable data governance is essential for ensuring security, compliance, and accessibility. This session explores how to deploy Unity Catalog OSS on Kubernetes, leveraging its cloud-agnostic nature and efficient resource management. Helm makes Unity Catalog deployment simple and easy by providing a simplified installation process, easy configuration and credentials management.The session will cover why Kubernetes is the ideal platform, provide a technical breakdown of Unity Catalog on Kubernetes, and include a live showcase of its seamless deployment process. By the end, participants will confidently configure and deploy Unity Catalog OSS in their preferred Kubernetes environment and integrate it into their existing infrastructure.

Securing Helm charts in public repositories

2025-05-20 · Identifying vulnerabilities in public Kubernetes Helm charts

talk

by Nigel Douglas (Cloudsmith)

chart integrity dependency vulnerabilities devsecops helm rbac supply chain security

30-minute talk on the evolving threat landscape around Helm charts in public repositories. We’ll discuss real-world incidents such as the Codecov supply chain attack and hypothetical attack vectors like 'ChartSploit', highlighting how seemingly benign configurations can be exploited. Topics include anatomy of vulnerable charts, risk areas (RBAC misconfigurations, dependency vulnerabilities), and actionable strategies to secure Kubernetes environments—auditing deployments, verifying chart integrity, enforcing strict access controls, and adopting DevSecOps practices.

How Kubernetes is Built with Kat Cosgrove

2025-05-14 · The Pragmatic Engineer Listen

podcast_episode

by Gergely Orosz , Kat Cosgrove

AI/ML Cloud Computing GenAI Linux Marketing SaaS

Supported by Our Partners •⁠ WorkOS — The modern identity platform for B2B SaaS. •⁠ Modal⁠ — The cloud platform for building AI applications. •⁠ Cortex⁠ — Your Portal to Engineering Excellence. — Kubernetes is the second-largest open-source project in the world. What does it actually do—and why is it so widely adopted? In this episode of The Pragmatic Engineer, I’m joined by Kat Cosgrove, who has led several Kubernetes releases. Kat has been contributing to Kubernetes for several years, and originally got involved with the project through K3s (the lightweight Kubernetes distribution). In our conversation, we discuss how Kubernetes is structured, how it scales, and how the project is managed to avoid contributor burnout. We also go deep into: • An overview of what Kubernetes is used for • A breakdown of Kubernetes architecture: components, pods, and kubelets • Why Google built Borg, and how it evolved into Kubernetes • The benefits of large-scale open source projects—for companies, contributors, and the broader ecosystem • The size and complexity of Kubernetes—and how it’s managed • How the project protects contributors with anti-burnout policies • The size and structure of the release team • What KEPs are and how they shape Kubernetes features • Kat’s views on GenAI, and why Kubernetes blocks using AI, at least for documentation • Where Kat would like to see AI tools improve developer workflows • Getting started as a contributor to Kubernetes—and the career and networking benefits that come with it • And much more! — Timestamps (00:00) Intro (02:02) An overview of Kubernetes and who it’s for (04:27) A quick glimpse at the architecture: Kubernetes components, pods, and cubelets (07:00) Containers vs. virtual machines (10:02) The origins of Kubernetes (12:30) Why Google built Borg, and why they made it an open source project (15:51) The benefits of open source projects (17:25) The size of Kubernetes (20:55) Cluster management solutions, including different Kubernetes services (21:48) Why people contribute to Kubernetes (25:47) The anti-burnout policies Kubernetes has in place (29:07) Why Kubernetes is so popular (33:34) Why documentation is a good place to get started contributing to an open-source project (35:15) The structure of the Kubernetes release team (40:55) How responsibilities shift as engineers grow into senior positions (44:37) Using a KEP to propose a new feature—and what’s next (48:20) Feature flags in Kubernetes (52:04) Why Kat thinks most GenAI tools are scams—and why Kubernetes blocks their use (55:04) The use cases Kat would like to have AI tools for (58:20) When to use Kubernetes (1:01:25) Getting started with Kubernetes (1:04:24) How contributing to an open source project is a good way to build your network (1:05:51) Rapid fire round — The Pragmatic Engineer deepdives relevant for this episode: •⁠ Backstage: an open source developer portal •⁠ How Linux is built with Greg Kroah-Hartman •⁠ Software engineers leading projects •⁠ What TPMs do and what software engineers can learn from them •⁠ Engineering career paths at Big Tech and scaleups — See the transcript and other references from the episode at ⁠⁠https://newsletter.pragmaticengineer.com/podcast⁠⁠ — Production and marketing by ⁠⁠⁠⁠⁠⁠⁠⁠https://penname.co/⁠⁠⁠⁠⁠⁠⁠⁠. For inquiries about sponsoring the podcast, email [email protected].

Get full access to The Pragmatic Engineer at newsletter.pragmaticengineer.com/subscribe

Apache Kafka in Action

2025-05-04 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Alexander Kropp , Anatoly Zelenin (DataFlow Academy)

Analytics Cloud Computing Kafka Microsoft Data Streaming data data-engineering streaming-messaging

Apache Kafka, start to finish. Apache Kafka in Action: From basics to production guides you through the concepts and skills you’ll need to deploy and administer Kafka for data pipelines, event-driven applications, and other systems that process data streams from multiple sources. Authors Anatoly Zelenin and Alexander Kropp have spent years using Kafka in real-world production environments. In this guide, they reveal their hard-won expert insights to help you avoid common Kafka pitfalls and challenges. Inside Apache Kafka in Action you’ll discover: Apache Kafka from the ground up Achieving reliability and performance Troubleshooting Kafka systems Operations, governance, and monitoring Kafka use cases, patterns, and anti-patterns Clear, concise, and practical, Apache Kafka in Action is written for IT operators, software engineers, and IT architects working with Kafka every day. Chapter by chapter, it guides you through the skills you need to deliver and maintain reliable and fault-tolerant data-driven applications. About the Technology Apache Kafka is the gold standard streaming data platform for real-time analytics, event sourcing, and stream processing. Acting as a central hub for distributed data, it enables seamless flow between producers and consumers via a publish-subscribe model. Kafka easily handles millions of events per second, and its rock-solid design ensures high fault tolerance and smooth scalability. About the Book Apache Kafka in Action is a practical guide for IT professionals who are integrating Kafka into data-intensive applications and infrastructures. The book covers everything from Kafka fundamentals to advanced operations, with interesting visuals and real-world examples. Readers will learn to set up Kafka clusters, produce and consume messages, handle real-time streaming, and integrate Kafka into enterprise systems. This easy-to-follow book emphasizes building reliable Kafka applications and taking advantage of its distributed architecture for scalability and resilience. What's Inside Master Kafka’s distributed streaming capabilities Implement real-time data solutions Integrate Kafka into enterprise environments Build and manage Kafka applications Achieve fault tolerance and scalability About the Reader For IT operators, software architects and developers. No experience with Kafka required. About the Authors Anatoly Zelenin is a Kafka expert known for workshops across Europe, especially in banking and manufacturing. Alexander Kropp specializes in Kafka and Kubernetes, contributing to cloud platform design and monitoring. Quotes A great introduction. Even experienced users will go back to it again and again. - Jakub Scholz, Red Hat Approachable, practical, well-illustrated, and easy to follow. A must-read. - Olena Kutsenko, Confluent A zero to hero journey to understanding and using Kafka! - Anthony Nandaa, Microsoft Thoughtfully explores a wide range of topics. A wealth of valuable information seamlessly presented and easily accessible. - Olena Babenko, Aiven Oy

Debugging apps on Google Kubernetes Engine

2025-04-11 · Google Cloud Next '25

session

Cloud Computing GCP

Debug Google Kubernetes Engine (GKE) apps like a pro! This hands-on lab covers using Cloud Logging & Monitoring to detect, diagnose, and resolve issues in a microservices application deployed on GKE. Learn practical troubleshooting workflows.

If you register for a Learning Center lab, please ensure that you sign up for a Google Cloud Skills Boost account for both your work domain and personal email address. You will need to authenticate your account as well (be sure to check your spam folder!). This will ensure you can arrive and access your labs quickly onsite. You can follow this link to sign up!

Architectural approaches for RAG infrastructure

2025-04-11 · Google Cloud Next '25

session

by Megan O'Keefe (Google) , Kumar Dhanagopal (Google Cloud)

AI/ML Cloud Computing GCP GenAI RAG

Unlock the power of generative AI with retrieval augmented generation (RAG) on Google Cloud. In this session, we’ll navigate key architectural decisions to deploy and run RAG apps: from model and app hosting to data ingestion and vector store choice. We’ll cover reference architecture options – from an easy-to-deploy approach with Vertex AI RAG Engine, to a fully managed solution on Vertex AI, to a flexible DIY topology with Google Kubernetes Engine and open source tools – and compare trade-offs between operational simplicity and granular control.

Data on Kubernetes: Run stateful apps and AI workloads on GKE

2025-04-11 · Google Cloud Next '25

session

by Ugur Arpaci (Codeway) , Brian Kaufman (Google Cloud) , Volkan Aydingul (Codeway) , Thierry Damiba (Qdrant)

AI/ML Cloud Computing Cloud Storage Data Management

This session explores the evolution of data management on Kubernetes for AI and machine learning (ML) workloads and modern databases, including Google’s leadership in this space. We’ll discuss key challenges and solutions, including persistent storage with solutions like checkpointing and Cloud Storage FUSE, and accelerating data access with caching. Customers Qdrant and Codeway will share how they’ve successfully leveraged these technologies to improve their AI, ML, and database performance on Google Kubernetes Engine (GKE).

Effortless AI/ML: Accessing GPUs and TPUs on GKE made easy

2025-04-11 · Google Cloud Next '25

session

by Fisayo Feyisetan (Google Cloud) , Ed Shrager (Baseten) , Michal Zylinski (Google Cloud)

AI/ML

Running AI workloads on Google Kubernetes Engine (GKE) presents unique challenges, especially for securing the right hardware. Whether you’re dealing with unpredictable demand and varying job durations or simply looking to control costs, this session will equip you with the knowledge and tools to make informed decisions about your GKE AI infrastructure. We’ll explore recent advancements in Dynamic Workload Scheduler, custom compute classes, and Kueue, demonstrating how these technologies can help you effectively access and manage diverse hardware resources.

The need for speed: How our customers are slashing AI model startup latency

2025-04-11 · Google Cloud Next '25

session

by Gari Singh (Google Cloud) , Brandon Royal (Google Cloud)

AI/ML

Grappling with scaling your AI and machine learning (ML) platforms to meet demand and ensuring rapid recovery from failures? This session dives into strategies for optimizing end-to-end startup latency for AI and ML workloads on Google Kubernetes Engine (GKE). We’ll explore how image and pod preloading techniques can significantly reduce startup times, enabling faster scaling and improved reliability. Real-world examples will show how this has led to dramatic improvements in application performance, including a 95% reduction in pod startup time and 1.2x–2x speedup.

talk-data.com

Activity Trend

Top Events

Top Speakers

Benchmarking the Performance of Dynamically Generated DAGs

Building an MLOps Platform for 300+ ML/DS Specialists on Top of Airflow

Designing Scalable Retrieval-Augmented Generation (RAG) Pipelines at SAP with Apache Airflow

Do you trust Airflow with your money? (We do!)

From Complexity to Simplicity with TaskHarbor: Trendyol's Path to a Unified Orchestration Platform

Lessons learned for scaling up Airflow 3 in Public Cloud

Navigating Secure and Cost-Efficient Flink Batch on Kubernetes with Airflow

Scaling Airflow at OpenAI

Task failures troubleshooting based on Airflow & Kubernetes signals

Naci Simsek: Lets Deeploy Flink - Uncovering Hidden Depths of Yarn, Docker, Kubernetes & Beyond

The platform behind a generic noninvasive neuromotor interface for human-computer interaction

Deploying Unity Catalog OSS on Kubernetes: Simplifying Infrastructure Management

Securing Helm charts in public repositories

How Kubernetes is Built with Kat Cosgrove

Apache Kafka in Action

Debugging apps on Google Kubernetes Engine

Architectural approaches for RAG infrastructure

Data on Kubernetes: Run stateful apps and AI workloads on GKE

Effortless AI/ML: Accessing GPUs and TPUs on GKE made easy

The need for speed: How our customers are slashing AI model startup latency