Elad Eldor

Activities

2

talks

Author

Filter by Event / Source

Data Engineering Podcast 1 O'Reilly Data Engineering Books 1

Talks & appearances

2 activities · Newest first

Search activities →

Troubleshooting Kafka In Production

2023-12-24 · Data Engineering Podcast Listen

podcast_episode

with Elad Eldor , Tobias Macey

AI/ML Analytics Cloud Computing Data Engineering Data Lake Data Lakehouse

Summary

Kafka has become a ubiquitous technology, offering a simple method for coordinating events and data across different systems. Operating it at scale, however, is notoriously challenging. Elad Eldor has experienced these challenges first-hand, leading to his work writing the book "Kafka: : Troubleshooting in Production". In this episode he highlights the sources of complexity that contribute to Kafka's operational difficulties, and some of the main ways to identify and mitigate potential sources of trouble.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack You shouldn't have to throw away the database to build with fast-changing data. You should be able to keep the familiarity of SQL and the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date. With Materialize, you can! It’s the only true SQL streaming database built from the ground up to meet the needs of modern data products. Whether it’s real-time dashboarding and analytics, personalization and segmentation or automation and alerting, Materialize gives you the ability to work with fresh, correct, and scalable results — all in a familiar SQL interface. Go to dataengineeringpodcast.com/materialize today to get 2 weeks free! Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. Your host is Tobias Macey and today I'm interviewing Elad Eldor about operating Kafka in production and how to keep your clusters stable and performant

Interview

Introduction How did you get involved in the area of data management? Can you describe your experiences with Kafka?

What are the operational challenges that you have had to overcome while working with Kafka? What motivated to write a book about how to manage Kafka in production?

There are many options now for persistent data queues. What are the factors to consider when determining whether Kafka is the right choice?

In the case where Kafka is the appropriate tool, there are many ways to run it now. What are the considerations that teams need to work through when determining whether/where/how to operate a cluster?

When provisioning a Kafka cluster, what are the requirements that need to be considered when determining the sizing?

What are the axes along which size/scale need to be determined?

The core promise of Kafka is that it is a durable store for continuous data. What are the mechanisms that are available for preventing data loss?

Under what circumstances can data be lost?

What are the different failure conditions that cluster operators need to be aware of?

What are the monitoring strategies that ar

Kafka Troubleshooting in Production: Stabilizing Kafka Clusters in the Cloud and On-premises

2023-11-29 · O'Reilly Data Engineering Books O'Reilly Amazon

book

data data-engineering streaming-messaging Kafka Cloud Computing DataOps

This book provides Kafka administrators, site reliability engineers, and DataOps and DevOps practitioners with a list of real production issues that can occur in Kafka clusters and how to solve them. The production issues covered are assembled into a comprehensive troubleshooting guide for those engineers who are responsible for the stability and performance of Kafka clusters in production, whether those clusters are deployed in the cloud or on-premises. This book teaches you how to detect and troubleshoot the issues, and eventually how to prevent them. Kafka stability is hard to achieve, especially in high throughput environments, and the purpose of this book is not only to make troubleshooting easier, but also to prevent production issues from occurring in the first place. The guidance in this book is drawn from the author's years of experience in helping clients and internal customers diagnose and resolve knotty production problems and stabilize their Kafka environments. The book is organized into recipe-style troubleshooting checklists that field engineers can easily follow when under pressure to fix an unstable cluster. This is the book you will want by your side when the stakes are high, and your job is on the line. What You Will Learn Monitor and resolve production issues in your Kafka clusters Provision Kafka clusters with the lowest costs and still handle the required loads Perform root cause analyses of issues affecting your Kafka clusters Know the ways in which your Kafka cluster can affect its consumers and producers Prevent or minimize data loss and delays in data streaming Forestall production issues through an understanding of common failure points Create checklists for troubleshooting your Kafka clusters when problems occur Who This Book Is For Site reliability engineers tasked with maintaining stability of Kafka clusters, Kafka administrators who troubleshoot production issues around Kafka, DevOps and DataOps experts who are involved with provisioning Kafka (whether on-premises or in the cloud), developers of Kafka consumers and producers who wish to learn more about Kafka