talk-data.com talk-data.com

Meetup talk 2024-02-29 at 18:30

"Handling incidents collaboratively is like solving a Rubik's Cube"

Description

Understanding the business outcome and the overall functionality of a system consisting of multiple distributed services and the infrastructure components to run them at scale is almost like solving a Rubik's Cube. Once an incident occurs, it is not enough to look at the single side of a Rubik's Cube. To solve the puzzle, all sides of the cube need to be considered. Similarly, when solving an incident, a collaboration of different teams is needed.

Administering and monitoring a distributed system should not be the single effort of a single engineering team. Observability should be a goal and have value for all engineering teams. Nevertheless, it is often a mantra just for SRE teams.

Coming from the perspective of an application engineer, I will outline how an application engineer benefits from understanding infrastructure and common incidents and how SRE teams can benefit from understanding common failures when talking about the application code. Let’s take a deeper look at what collaboration across different engineering teams means and how it supports the process of resolving the Rubik's Cube together.