talk-data.com talk-data.com

Topic

Kubernetes

container_orchestration devops microservices

6

tagged

Activity Trend

40 peak/qtr
2020-Q1 2026-Q1

Activities

Showing filtered results

Filtering by: Airflow Summit 2020 ×

Identify issues in a fraction of the time and streamline root cause analysis for your DAGs. Airflow is the leading orchestration platform for data engineers. But when running Airflow at production scale, many teams have bigger needs for monitoring jobs, creating the right level of alerting, tracking problems in data, and finding the root cause of errors. In this talk we will cover our suggested approach to gaining Airflow observability so that you have the visibility you need to be productive. What is observability? The capability of monitoring and analyzing event logs, along with KPIs and other data, that yields actionable insights. In the data engineering context, observability is crucial for finding problems in jobs and data before those problems impact data consumers downstream. It’s a particularly difficult challenge because of the different platforms data engineers use (Airflow, Spark, Kubernetes, etc.) and the complicated life cycle of data pipeline CI/CD. In the session, we will do a deep dive into the visibility gaps your team might face running production-scale Airflow. We will walk through a typical day in the life of finding errors in DAGs, offer best practices, and discuss open source tools you can use to extend Airflow for observability and robust monitoring. We will use standard Airflow DAG examples to guide the presentation.

At Nielsen Digital we have been moving our ETLs to containerized environments managed by Kubernetes. We have successfully transferred some of our ETLs to this environment in production. In order to do this we used the following technologies: Helm to easily deploy Airflow on to Kubernetes; Airflow’s Kubernetes Executor to take full advantage Kubernetes features; and Airflow’s Kubernetes Pod Operator in order to execute our containerized Tasks within our DAGs. To automate a lot of the deployment process we also used Terraform. Lastly, Kubernetes features were used to gain much more fine grained control of Airflows infrastructure. Join me in this talk to take an in depth look at how we used these technologies, why we used these technologies, and the results of using them so far. I will also briefly go over some features coming in Airflow 2.0 that we are considering to use in our workflows.

Financial Times is increasing its digital revenue by allowing business people to make data-driven decisions. Providing an Airflow based platform where data engineers, data scientists, BI experts and others can run language agnostic jobs was a huge swing. One of the most successful steps in the platform’s development was building our own execution environment, allowing stakeholders to self deploy jobs without cross team dependencies on top of the unlimited scale of Kubernetes. In this talk we share how we have integrated and extended Airflow at Financial Times. The main topics we will cover include: Providing team level security isolation Removing cross team dependencies Creating execution environment for independently creating and deploying R, Python, JAVA, Spark, etc jobs Reducing latency when sharing data between task instances Integrating all these features on top of Kubernetes

Learn how Devoted Health went from cron jobs to Airflow deployment Kubernetes using a combination of open source and internal tooling. Devoted Health, a Medicare Advantage startup, went from cron jobs to Airflow on Kubernetes in a short period of time. This journey is a common one, but still has a steep learning curve for new Airflow users. This talk will give you a blueprint to follow by covering the tools we use, best practices, and lessons learned. We’ll share Devoted’s approach to managing our deployment, monitoring the platform, and developing, testing, and deploying DAGs. This includes internal tooling we’ve written that allows Data Scientists to work with Airflow without worrying about Airflow itself.

At Nielsen Identity Engine, we use Spark to process 10’s of TBs of data. Our ETLs, orchestrated by Airflow, spin-up AWS EMR clusters with thousands of nodes per day. In this talk, we’ll guide you through migrating Spark workloads to Kubernetes with minimal changes to Airflow DAGs, using the open-sourced GCP Spark-on-K8s operator and the native integration we recently contributed to the Airflow project.

In this talk, we share the lessons learned while building a scheduler-as-a-service leveraging Apache Airflow to achieve improved stability and security for one of the largest gaming companies. The platform integrates with different data sources and meets varied SLA’s across workflows owned by multiple game studios. In particular, we present a comprehensive self-serve airflow architecture with multi-tenancy, auto-dag generation, SSO-integration with improved ease of deployment. Within Electronic Arts, to provide scheduler-as-a-service and to support hundreds of thousands of execution workflows, each team requires an isolated environment with access to a central data lake containing several petabytes of anonymized player and game metrics. Leveraging Airflow, each team is provided a private code repository and namespace with which they can deploy their DAGs at their own behest. To support agile development cycles, a private testing sandbox and auto-deployment to an isolated multi-tenant airflow platform has been made available to game studios. In production, a single dockerized airflow deployment on Kubernetes is utilized to ensure highly availability and single-step deployment. Custom SSO-integration and RBAC-based operator and sensor whitelisting allows for secure logical isolation. In addition, providing dynamic DAG instantiation capability helps address varied SLA’s during game launch seasons that are staggered through a financial year.