Search – talk-data.com

Title & Speakers	Event
Google NY Site Reliability Engineering (SRE) Tech Talks, 16 Dec 2025 2025-12-16 · 23:00 Google SRE NYC proudly announces our last Google SRE NYC Tech Talk for 2025. This event is co-sponsored by sentry.io. Thank you Sentry for your partnership! Let's farewell 2025 with three amazing interactive short talks on Site Reliability and DevOps topics! As always the event will include an opportunity to mingle with the speakers and attendees over some light snacks and beverages after the talks. The Meetup will take place on Tuesday, 16th of December 2025 at 6:00 PM at our Chelsea Markets office in NYC. The doors will open at 5:30 pm. Pls RSVP only if you're able to attend in-person, there will be no live streaming. When RSVP'ing to this event, please enter your full name exactly as it appears on your government issued ID. You will be required to present your ID at check in. Agenda: Paul Jaffre - Senior Developer Experience Engineer\, sentry.io *One Trace to Rule Them All: Unifying Sentry Errors with OpenTelemetry tracing* SREs face the challenge of operating reliable observability infrastructure while avoiding vendor lock-in from proprietary APM (Application Performance Monitoring) solutions. OpenTelemetry has become the standard for instrumenting applications, allowing teams to collect traces, metrics, and logs. But raw telemetry data isn't enough. SREs need tools to visualize, debug, and respond to production incidents quickly. Sentry now supports OTLP, enabling teams to send OpenTelemetry data directly to Sentry for analysis. This talk covers how Sentry's OTLP support works in practice: connecting frontend and backend traces across services, correlating logs with distributed traces, and using tools to identify slow queries and performance bottlenecks. We'll discuss the practical benefits for SREs, like faster incident resolution, better cross-team debugging, and the flexibility to change observability backends without re-instrumenting code. Paul’s background spans engineering, product management, UX design, and open source. He has a soft spot for dev tools and loses sleep over making things easy to understand and use. Paul has a dynamic professional background, from strategy to stability. His time at Krossover Intelligence established a strong foundation by blending Product Management with hands-on development, and he later focused on core reliability at MakerBot, where he implemented automated end-to-end testing and drove performance improvements. He then extended this expertise in stability and scale at Cypress.io, where he served as a Developer Experience Engineer, focusing on improving workflow, contribution, and usability for their widely adopted open-source community. Thiara Ortiz - Cloud Gaming SRE Manager\, Netflix *Managing Black Box Systems* SREs often face ambiguity when managing black box systems (LLMs, Games, Poorly Understood Dependencies). We will discuss how Netflix monitors service health as black boxes using multiple measurement techniques to understand system behavior, aligning with the need for robust observability tools. These strategies are crucial for system reliability and user experience. By proactively identifying and resolving issues, we ensure smoother playback experience and maintain user trust, even as the platform continues to evolve and gain maturity. The principles shared within this talk can be expanded to other applications such as AI reliability in data quality and model deployments. Thiara has worked at some of the largest internet companies in the world, Meta and Netflix. During her time at Meta, Thiara found a passion for distributed systems and bringing new hardware into production. Always curious to explore new solutions to complex problems, Thiara developed Fleet Scanner, internally known as Lemonaid, to perform memory, compute, and storage benchmarks on each Meta server in production. This service runs on over 5 million servers and continues to be utilized at Meta. Since Meta, Thiara has been working at Netflix as a Senior CDN Reliability engineer, and now, Cloud Gaming SRE Manager. When incidents occur and Netflix's systems do not behave as expected, Thiara can be found working and engaging the necessary teams to remediate these issues. Andrew Espira - Platform and Site Reliability Engineer\, Founding Engineer kustode *ML-Powered Predictive SRE: Using Behavioral Signals to Prevent Cluster Inefficiencies Before They Impact Production* SREs managing ML clusters often discover resource inefficiencies and queue bottlenecks only after they've impacted production services. This talk presents a machine learning approach to predict these issues before they occur, transforming SRE from reactive firefighting to proactive system optimization. We demonstrate how to build predictive models using production cluster traces that identify two critical failure modes: (1) GPU under-utilization relative to requested resources, and (2) abnormal queue wait times that indicate impending service degradation. The SRE practitioners will learn how to extract early warning indicators from standard cluster logs, build ML models that provide actionable confidence scores for operational decisions, and take practical steps to integrate predictive analytics into existing SRE toolchains to achieve 50%+ reduction in resource waste and queue-related incidents This talk bridges the gap between traditional SRE observability and modern predictive analytics, showing how teams can evolve from reactive monitoring to intelligent, forward-looking reliability engineering" Andrew has over 8 years of experience architecting and maintaining large-scale distributed systems. He is the Founding Engineer of Kustode (kustode.com), where he develops cutting-edge reliability and observability solutions for modern infrastructure in the Insurance and health care solutions space. Currently pursuing graduate studies in Data Science at Saint Peter's University, he specializes in the intersection of reliability engineering and artificial intelligence. His research focuses on applying machine learning to operational challenges, with publications in peer-reviewed venues including ScienceDirect. He's passionate about making complex systems more predictable and maintainable through data-driven approaches. When not optimizing cluster performance or building the next generation of observability tools, Andrew enjoys contributing to open-source projects and mentoring early-career engineers in the SRE community. Our Tech Talks series are for professional development and networking: no recruiters, sales or press please! Google is committed to providing a harassment-free and inclusive conference experience for everyone, and all participants must follow our Event Community Guidelines. The event will be photographed and video recorded. Event space is limited! A reservation is required to attend. Reserve your spot today and share the event details with your SRE/DevOps friends 🙂	Google NY Site Reliability Engineering (SRE) Tech Talks, 16 Dec 2025
Event Google SRE NY Tech Talk 2024-05-22
The Hammer Changes the Hand 2024-05-22 · 22:00 Sal Furino – Customer Reliability Engineer (CRE) @ Bloomberg Imagine you’re observing a worker swinging a hammer. As they swing the hammer, they make small adjustments to better hit and drive the nail or rivet into the surface. These adjustments are made unconsciously. The hammer has become an extension of their arm. It’s important to consider that the arm doesn’t just change the hammer; it gives it new meaning beyond that of simply some wood and steel. But the hammer also changes the arm! Weeks, months, years of swinging that hammer changes the worker themselves. The tools we use change us and enable us to think and interact with the world differently. This talk will briefly explore how to view internal tooling through the lens of product management in not just developing and shipping features, but how those features empower teams to change their understanding of their social-technical systems.
How we measure Quality of Experience to ensure our members get a world class experience they have come to expect from Netflix 2024-05-22 · 22:00 Thiara Ortiz – Staff CDN Reliability Engineer @ Netflix Any time a Netflix member sits down, reclines in their chair and turns on their TV to Netflix, there's a moment of truth. It's an opportunity to deliver a spectacular service with amazing quality of experience. Misses, errors, or high latency that prevent individuals from streaming, as a result of ISP configuration changes, code deployment, or catastrophic fallback, result in an impact on how our service is perceived. This talk will go over how we measure the quality of experience for our members and how we work to develop new metrics when we have additional offerings like live streaming and cloud gaming. Cloud Computing Data Streaming
LLM for SRE / Using LLM in SRE space 2024-05-22 · 22:00 Mike Scherbakov – Staff Site Reliability Engineer @ Google LLMs open up an opportunity to automate and scale many operational processes, which couldn't be otherwise solved by conventional methods. Examples include simple summarization of issues and incidents, assisting production on-callers, managing incidents, clustering (creating taxonomy) of issues, scaling SRE via assisted review of development design documents. Therefore LLMs provide a new and unique opportunity to transform the work we do as SREs. LLM

Event Google SRE NY Tech Talk 2024-02-07
Mechanical systems -> Biological systems: How managing infrastructure changes with scale and so how should we approach it 2024-02-07 · 23:00 Sami Meharzi – SRE, Big Table @ Google When starting, software systems are similar to mechanical systems where functionality and changes are fairly predictable. However, with more automation and dynamic interactions, software systems start behaving more like biological systems/ecosystems. This sometimes leads to relatively small things having crazy unintended consequences and large things not quite having as much impact as one would hope. This stems from the full ecosystem and how everything (eventually) has some impact on everything else. With this in mind, there are approaches to solving problems at scale that would not make sense otherwise and some approaches that are detrimental.
Fake 'till you make it: Get the most out of incident simulations 2024-02-07 · 23:00 Ashley Sawatsky – Senior Reliability & Incident Response Advocate @ Rootly Palms are sweaty, knees weak, arms are heavy...sound like your first on-call shift? One of the biggest challenges in incident response work, especially for newer SREs, is the lack of safe spaces to fail. Incident simulations can be an effective way to take the terror out of that first on-call shift, but they take careful planning. In this talk, I’ll explore different types of simulations (from tabletops to full-on realistic mock incidents), how and when to utilize them, and how to make sure you get the most out of them.
We've Done Everything Right. But Bad Things Keep Happening and Now What? 2024-02-07 · 23:00 Mattie Toia – Engineering Director, Production Platform Infrastructure @ Shopify While we all can find places to improve, this talk will discuss how we can respond when bad things happen despite our implementation of many if not all of the recommended reliability practices. We'll talk about reasons why this might be the case, and then we'll examine some possible approaches to addressing them.

Google SRE NY Tech Talk 2023-11-02 · 22:00 Google SRE NYC is proud to present our latest Tech Talk. We have an exciting lineup of speakers, followed by a social mixer with light snacks and beverages. We look forward to seeing everyone on Nov 2 at our Chelsea Markets office. The spaces are limited, pls RSVP now and secure your spot. Our venue this time is right above the famous Chelsea Markets. Enter the Google Chelsea Markets lobby via the 9th Ave (cor. 16th St) stairs B or use the elevator 33. We will have our team members and Google security to guide you and assist with elevator access. Ellora Praharaj (Stack Overflow) - "The Good, the Bad, and the “Uh …” [Agile for SRE & Platform teams]" Ellora Praharaj is the Director of Reliability Engineering at Stack Overflow. As a technologist with over a decade of experience working with and building high-performing engineering teams, she oversees the SRE teams at Stack Overflow. Previously, she spent over 11 years at Bloomberg, where she was an Engineering Team Lead in their SRE organization. Ellora received her Masters Degree in Computer Science from University of Buffalo. In this presentation, we'll dive into the challenges with enforcing Agile processes for SREs. We'll take a look at some real-world challenges that SRE and platform teams face when navigating the Agile landscape, drawing from my own experience leading such teams. We'll also discuss some strategies for overcoming these challenges, and look ahead to the future of Agile and SRE. Michelangelo Mecozzi (Google) - “Handling Spiky Traffic At Scale” Michaelangelo is an SRE on Firebase NYC team at Google, he is a cuisine fanatic, soccer lover (AC Milan fan) Learn how to prepare your service to handle spiky traffic generated at scale. We will explore the challenges, best practices, and lessons learned while managing traffic generated by the FIFA 2022 World Cup. José Velez & Pascal Bovet (rely.io) - “The journey of building a startup in the reliability space” José is the Founder & CEO of Rely.io, where he put together a team of talented people passionate about SRE and Platform Engineering who are building an internal developer portal that allows engineering teams to get automated visibility into the inventory of services and user journeys and into their health, quality and operational maturity. Prior to founding Rely, Jose built an internal AIOps observability tool for EDP as a SWE. Pascal Bovet is an entrepreneur and startup advisor. Prior to that Pascal led the Reliability Department of Robinhood and led multiple infrastructure SRE teams at Google. Delivering fast and maintaining reliability is no small feat. In today's economic climate, this is crucial for companies to thrive - sometimes even to survive. Join Jose and Pascal and they share their journey to build a startup that aims to solve the biggest challenges engineering teams face that prevent them from achieving high velocity without compromising quality. They'll cover everything from the customer discovery they did to understand the key market pain points, to the SRE and Platform Engineering practices they ended up productising into the developer portal they built. No sales pitch, just sharing practical knowledge and experiences.	Google SRE NY Tech Talk
SRE NY Tech Talks 2023-08-02 · 22:00 Google SRE proudly announces the next event in its Site Reliability Engineering (SRE) Tech Talks series on Wednesday, August 2nd at Google’s Pier 57 building in NYC. The event starts at 6:00 PM and lasts until 8:30 PM. We invite you to join us for an hour of short talks on Reliability and DevOps topics, followed by an opportunity to meet and talk with fellow engineers over light refreshments. We are pleased to welcome the following speakers: Jeff Luery and Yash Mestry, Perpetual. “DevSecOps and SRE integration” Our talk will specifically communicate how SRE and DevOps processes can be implemented into enterprise software engineering and web development projects. We will cover industry best practices, as well as magical tools and tricks to maximize server uptime, performance, reliability and overall efficiency. Hasit Mistry, FluxNinja: “Achieving Fault Tolerance with Observability-driven Load Management” This talk will help the audience build an intuition about load management, starting from basic principles of queuing theory and Little's law. These principles help build understanding of complex failure scenarios and how they manifest in microservices. Following this, the session illustrates how early adopters are making use of Aperture to gracefully degrade their applications during complex failures. In essence, the goal of this talk is to enhance the community's collective understanding of system reliability and the potential of observability driven closed loop automation techniques for effective load management. Ensuring reliable operation of microservices is a challenging task. Metastable failures such as cascading overloads, retry storms and death spirals cause services to enter a permanent state of failure that requires manual intervention to recover. Mitigation strategies like circuit breakers and auto-scaling fall short due to their narrow vantage points. To operate microservices reliably, observability-driven automation is required. Aperture is an open source load management system that leverages CNCF technologies such as etcd, Prometheus, OpenTelemetry, Open Policy Agent, and Istio/Envoy. It combines ideas from the world of observability, control systems, and network scheduling to automate service protection and workload prioritization. Andreas Bobak, Google NYC: “Frontend Design by SREs for SREs” In a world where SRE is quickly changing from running and maintaining their own “scripts” and writing large applications so that others can maintain and monitor their systems, new challenges wrapping up their tooling into something that low-context users can easily utilize. With that a whole new world of SRE frontend emerges. Here are a few ways SREs can think about doing good user interface design for their user-journeys.	SRE NY Tech Talks

talk-data.com

People (3 results)

Activities & events