Talk on leveraging AI in SRE to transform incident response, moving from firefighting to force multiplication while addressing related risks.
talk-data.com
Topic
sre
3
tagged
Activity Trend
Data safety is becoming increasingly important and this talk will introduce this to the audience, to open up beyond traditional losses around data integrity. When you think of SRE, RPC services and service operations immediately come to mind - Errors, latency, managing the size and number of tasks etc., For most products, there is another important story - that of data flows and data sets. A critical error in data (e.g. critical highway missing a segment in its route etc.,) could have widespread consequences to users. No amount of RPC service level reliability will protect against that risk. We need to think about safety against data loss.
Justin will explain through real-world use cases how teams can adopt the emerging practice of metric scorecards to reduce meetings and streamline release readiness assessments using data and automation. The list of criteria required to release a service to production, often referred to as a “production readiness standard,” is a mandatory component of reliable systems of software delivery. Aligning to these standards cross-functionally is challenging, especially when standards may need to be bypassed or changed, often at the last minute. And most importantly, systems always drift, and software that met these requirements six months ago may not still be meeting them today – so can they still be considered ready for production? Teams often resort to time-consuming practices which are brittle and difficult to change. Cortex has pioneered the scorecard as means of driving engineering initiatives using gamification. By ingesting data from the various systems that engineers would normally check manually process are streamlined and readiness checks transformed to an always-on, continuous verification of readiness.