talk-data.com
Machine Learning Powered Auto Remediation in Netflix Data Platform
Speakers
Topics
Description
Speakers: Stephanie Vezich Tamayo (Senior Machine Learning Engineer at Netflix) Binbing Hou (Senior Software Engineer at Netflix)
This tech talk is a part of the Data Engineering Open Forum at Netflix 2024. At Netflix, hundreds of thousands of workflows and millions of jobs are running every day on our big data platform, but diagnosing and remediating job failures can impose considerable operational burdens. To handle errors efficiently, Netflix developed a rule-based classifier for error classification called “Pensive.” However, as the system has increased in scale and complexity, Pensive has been facing challenges due to its limited support for operational automation, especially for handling memory configuration errors and unclassified errors. To address these challenges, we have developed a new feature called “Auto Remediation,” which integrates the rules-based classifier with an ML service.
If you are interested in attending a future Data Engineering Open Forum, we highly recommend you join our Google Group (https://groups.google.com/g/data-engineering-open-forum) to stay tuned to event announcements.