Want to prevent outages before they happen? Traditional SRE methods focus on component failures, but a whole class of outages stem from unexpected system interactions. We found a solution. In our team, we use Systems Theoretic Process Analysis (STPA) to identify and fix system-level vulnerabilities before they cause outages. By applying STPA during the design phase, we've prevented major incidents and saved countless engineering hours. This talk will show you how STPA can transform your approach to reliability. We'll share a real-world example where STPA caught critical design flaws that traditional methods missed, saving us months of costly rework. Don't wait for outages to happen. Learn how STPA can help you build more resilient systems and become a 1000x engineer. Theo is a Senior Site Reliability Engineer for Google Maps. He is leading a program to improve road closure data safety. Previously, he led a program identifying risky dependencies within Google Maps. In his spare time, he hosts supper clubs.
talk-data.com
Company
Google Maps
Speakers
2
Activities
2
Speakers from Google Maps
Talks & appearances
2 activities from Google Maps speakers
Theo Klein
(Senior Site Reliability Engineer)
Data safety is becoming increasingly important and this talk will introduce this to the audience, to open up beyond traditional losses around data integrity. When you think of SRE, RPC services and service operations immediately come to mind - Errors, latency, managing the size and number of tasks etc., For most products, there is another important story - that of data flows and data sets. A critical error in data (e.g. critical highway missing a segment in its route etc.,) could have widespread consequences to users. No amount of RPC service level reliability will protect against that risk. We need to think about safety against data loss.