talk-data.com talk-data.com

Filter by Source

Select conferences and events

People (3 results)

Showing 3 results

Activities & events

Title & Speakers Event
Ronaldo Arrudas – Digital Development Studio Leader @ Nearsure

Many SRE teams still rely on manual intervention for incident handling; automation can improve response times and reduce toil. We will cover: Setting up comprehensive observability: Cloud Logging, Cloud Monitoring, and OpenTelemetry; Incident automation strategies: Runbooks, Auto-Healing, and ChatOps; Lessons from AWS CloudWatch and Azure Monitor applied to GCP; Case study: Reducing MTTR (Mean Time to Resolution) through automated detection and remediation

AWS CloudWatch Azure Cloud Computing GCP
Stepan Hruda – Software Engineer at Meta, working at Reality Labs on infrastructure for research teams @ Meta Reality Labs , Rudi Chiarito – Former Research Engineer at Meta and a former SRE at Google

In 2024, the Ctrl-labs team at Meta Reality Labs published a preprint, introducing the science behind a new neural input device worn on the wrist. This talk will cover the custom Kubernetes-based platform underlying both the research/ML workloads and the data collection. We'll talk about the challenges of serving 'only' hundreds of internal scientists and engineers, while also supporting data collection from thousands of participants. We'll cover the evolution of the services and codebase, the reliability tradeoffs, the growing pains and the custom tools that we had to build.

AI/ML Data Collection Kubernetes
Gideon Lapshun – Senior Solutions Engineer @ Rootly

We'll explore how vibe coding impacts SRE teams. Attendees will learn how this shift affects reliability and incident response and the challenges it introduces, such as reduced familiarity with codebases among developers and the loss of subject matter expertise. We'll discuss why 'incident vibing' - leveraging automation and AI-driven features to tackle increased incident volume - is crucial. The audience will learn practical strategies for: - Accelerating incident response using AI-generated incident briefings and automated post-mortem drafts. - Streamlining root cause analysis and resolution through AI-powered anomaly detection and contextual data ingestion. - Mitigating the limitations of AI systems, such as hallucinations and a lack of context. Ultimately, this talk is about turning a risk into a competitive advantage. Not only empowering SRE teams to handle the growing challenges of AI-driven development, but also graduate to achieving the elusive 'six nines' of reliability.

AI/ML
Showing 3 results