Pinterest has been part of the Airflow community for two years and has worked on many custom solutions to address usability, scalability, and efficiency constraints. This session is to discuss how Pinterest has further expanded on those previous solutions. We will discuss how we work to further reduce system latencies, improve user development through added search features, support for cross cluster operations, and improved debuggability tooling, and system level efficiency improvements to auto retry failed tasks that meet certain criteria.
talk-data.com
Speaker
Ace Haidrey
4
talks
Frequent Collaborators
Filter by Event / Source
Talks & appearances
4 activities · Newest first
Last year, we were able to share why we have selected Airflow to be our next generation workflow system. This year, we will dive into the journey of migrating over 3000+ workflows and 45000+ tasks to Airflow. We will discuss the infrastructure additions to support such loads, the partitioning and prioritization of different workflow tiers defined in house, the migration tooling we built to get users to onboard, the translation layers between our old DSLs and the new, our internal k8s executor to leverage Pinterest’s kubernetes fleet, and more. We want to share the challenges both technically and usability wise to get such large migrations over the course of a year, and how we overcame it to successfully migrate 100% of the workflows to our inhouse workflow platform branded Spinner.
The two most common user questions at Pinterest are: 1) why is my workflow running so long? 2) why did my workflow fail - is it my issue, or a platform issue? As with any big data organization, the workflow platform is just the orchestrator but the “real” work is done on another layer, managed by another platform. There can be plenty of these, and the challenges of figuring out the root cause of an issue can be mundane and time consuming. At Pinterest, we set out to provide additional tooling in our Airflow webserver to make it a quicker inspection process and provide smart tips such as increased runtime analysis, bottleneck identifying, rca, and an easy way for backfilling. We explore deeper the tooling provided to reduce the admin load, and empower our users.
At Pinterest, our current workflow system, called pinball, has served the data pipeline orchestration demands well for years. However, with the rapid increasing execution demand the system started to expose scalability and performance issues. Therefore we decided to look for a new solution to better address the issues and serve the workflow scheduling demand, and we chose Airflow as our next generation of workflow. In this talk we discuss how we made the decision to on board to Apache Airflow, and beyond the out-of-box features and experience what improvements we made to better support the business need at Pinterest.