talk-data.com
People (3 results)
Activities & events
| Title & Speakers | Event |
|---|---|
|
Automated Observability and Incident Response in GCP
2025-06-24 · 22:00
Ronaldo Arrudas
– Digital Development Studio Leader
@ Nearsure
Many SRE teams still rely on manual intervention for incident handling; automation can improve response times and reduce toil. We will cover: Setting up comprehensive observability: Cloud Logging, Cloud Monitoring, and OpenTelemetry; Incident automation strategies: Runbooks, Auto-Healing, and ChatOps; Lessons from AWS CloudWatch and Azure Monitor applied to GCP; Case study: Reducing MTTR (Mean Time to Resolution) through automated detection and remediation |
|
|
The platform behind a generic noninvasive neuromotor interface for human-computer interaction
2025-06-24 · 22:00
Stepan Hruda
– Software Engineer at Meta, working at Reality Labs on infrastructure for research teams
@ Meta Reality Labs
,
Rudi Chiarito
– Former Research Engineer at Meta and a former SRE at Google
In 2024, the Ctrl-labs team at Meta Reality Labs published a preprint, introducing the science behind a new neural input device worn on the wrist. This talk will cover the custom Kubernetes-based platform underlying both the research/ML workloads and the data collection. We'll talk about the challenges of serving 'only' hundreds of internal scientists and engineers, while also supporting data collection from thousands of participants. We'll cover the evolution of the services and codebase, the reliability tradeoffs, the growing pains and the custom tools that we had to build. |
|
|
Vibe Coding and Site Reliability
2025-06-24 · 22:00
Gideon Lapshun
– Senior Solutions Engineer
@ Rootly
We'll explore how vibe coding impacts SRE teams. Attendees will learn how this shift affects reliability and incident response and the challenges it introduces, such as reduced familiarity with codebases among developers and the loss of subject matter expertise. We'll discuss why 'incident vibing' - leveraging automation and AI-driven features to tackle increased incident volume - is crucial. The audience will learn practical strategies for: - Accelerating incident response using AI-generated incident briefings and automated post-mortem drafts. - Streamlining root cause analysis and resolution through AI-powered anomaly detection and contextual data ingestion. - Mitigating the limitations of AI systems, such as hallucinations and a lack of context. Ultimately, this talk is about turning a risk into a competitive advantage. Not only empowering SRE teams to handle the growing challenges of AI-driven development, but also graduate to achieving the elusive 'six nines' of reliability. |
|