talk-data.com talk-data.com

Topic

Kafka

Apache Kafka

distributed_streaming message_queue event_streaming

240

tagged

Activity Trend

20 peak/qtr
2020-Q1 2026-Q1

Activities

240 activities · Newest first

Serverless Kafka and Apache Spark in a Multi-Cloud Data Lakehouse Architecture

Apache Kafka in conjunction with Apache Spark became the de facto standard for processing and analyzing data. Both frameworks are open, flexible, and scalable. Unfortunately, the latter makes operations a challenge for many teams. Ideally, teams can use serverless SaaS offerings to focus on business logic. However, hybrid and multi-cloud scenarios require a cloud-native platform that provides automated and elastic tooling to reduce the operations burden.

This post explores different architecture to build serverless Kafka and Spark multi-cloud architectures across regions and continents. We start from the analytics perspective of a data lake and explore its relation to a fully integrated data streaming layer with Kafka to build a modern data lakehouse. Real-world use cases show the joint value and explore the benefit of the "delta lake" integration.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Streaming Data into Delta Lake with Rust and Kafka

Scribd's data architecture was originally batch-oriented, but in the last couple years, we introduced streaming data ingestion to provide near-real-time ad hoc query capability, mitigate the need for more batch processing tasks, and set the foundation for building real-time data applications.

Kafka and Delta Lake are the two key components of our streaming ingestion pipeline. Various applications and services write messages to Kafka as events are happening. We were tasked with getting these messages into Delta Lake quickly and efficiently.

Our first solution was to deploy Spark Structured Streaming jobs. This got us off the ground quickly, but had some downsides.

Since Delta Lake and the Delta transaction protocol are open source, we kicked off a project to implement our own Rust ingestion daemon. We were confident we could deliver a Rust implementation since our ingestion jobs are append only. Rust offers high performance with a focus on code safety and modern syntax.

In this talk I will describe Scribd's unique approach to ingesting messages from Kafka topics into Delta Lake tables. I will describe the architecture, deployment model, and performance of our solution, which leverages the kafka-delta-ingest Rust daemon and the delta-rs crate hosted in auto-scaling ECS services. I will discuss foundational design aspects for achieving data integrity such as distributed locking with DynamoDb to overcome S3's lack of "PutIfAbsent" semantics, and avoiding duplicates or data loss when multiple concurrent tasks are handling the same stream. I'll highlight the reliability and performance characteristics we've observed so far. I'll also describe the Terraform deployment model we use to deliver our 70-and-growing production ingestion streams into AWS.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Auditing Your Data and Answering the Lifelong Question—Is It the End of the Day Yet?

Huge volumes of data flow through a robust Kafka architecture, into several ETLs, receiving, transforming and storing the data. We clearly understood our ETLs’ workflow and our data architecture, from source to destination.

But how much did we know about the way our data makes though our systems? And what about the life long question, is it the end of the day yet?

In this talk I’m going to present to you the design process behind our Data Auditing system, Life Line. From tracking and producing, to analyzing and storing auditing information, using technologies such as Kafka, Avro, Spark, Lambda functions and complex SQL queries. We’re going to cover: * AVRO Audit header * Auditing heart beat - designing your metadata * Designing and optimizing your auditing table - what does this data look like anyway? * Creating an alert based monitoring system * Answering the most important question of all - is it the end of the day yet?

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Elixir: The Wickedly Awesome Batch and Stream Processing Language You Should Have in Your Toolbox

Elixir is an Erlang-VM bytecode-compatible programming language that is growing in popularity.

In this session I will show how you can apply Elixir towards solving data engineering problems in novel ways.

Examples include: • How to leverage Erlang's lightweight distributed process coordination to run clusters of workers across docker containers and perform data ingestion. • A framework that hooks Elixir functions as steps into Airflow graphs. • How to consume and process Kafka events directly within Elixir microservices.

For each of the above I'll show real system examples and walk through the key elements step by step. No prior familiarity with Erlang or Elixir will be required.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Automating Business Decisions Using Event Streams

Today's real-time solutions demand continuousness, autonomy, and observability. Data streams have evolved to guarantee only continuousness; thus, streams alone will never satisfy this demand. Industries instead crave a properly end-to-end streaming architecture backing their applications and services -- a concept that has narrowly evaded realization until now.

In this session, Rohit Bose will demonstrate how such architectures cleanly solve complex problems. This will require two parts:

  1. Building an industry-specific application that continuously generates insights and reports them over dynamically-scoped real-time streams
  2. Discussing the advantages and generalizations of the application's design

The demo will utilize the Swim platform to expose thousands of streaming APIs seeded by an Apache Kafka firehose, enabling both real-time map visualizations and decision-making clients to instantly observe changes across distributed entities with zero unnecessary subscriptions.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

We talked about:

Jeff’s background Getting feedback to become a better teacher Going from engineering to teaching Jeff on becoming a curriculum writer Creating a curriculum that reinforces learning Jeff on starting his own data engineering bootcamp Shifting from teaching ML and data science to teaching data engineering Making sure that students get hired Screening bootcamp applicants Knowing when it’s time to apply for jobs The curriculum of JigsawLabs.io The market demand of Spark, Kafka, and Kubernetes (or lack thereof) Advice for data analysts that want to move into data engineering The market demand of ETL/ELT and DBT (or lack thereof) The importance of Python, SQL, and data modeling for data engineering roles Interview expectations How to get started in teaching The challenges of being a one-person company Teaching fundamentals vs the “shiny new stuff” JigsawLabs.io Finding Jeff online

Links: 

Jigsaw Labs: https://www.jigsawlabs.io/free Teaching my mom to code: https://www.youtube.com/watch?v=OfWwfTXGjBM Getting a Data Engineering Job Webinar with Jeff Katz: https://www.eventbrite.de/e/getting-a-data-engineering-job-tickets-310270877547

MLOps Zoomcamp: https://github.com/DataTalksClub/mlops-zoomcamp

Join DataTalks.Club: https://datatalks.club/slack.html

Our events: https://datatalks.club/events.html

Grokking Streaming Systems

A friendly, framework-agnostic tutorial that will help you grok how streaming systems work—and how to build your own! In Grokking Streaming Systems you will learn how to: Implement and troubleshoot streaming systems Design streaming systems for complex functionalities Assess parallelization requirements Spot networking bottlenecks and resolve back pressure Group data for high-performance systems Handle delayed events in real-time systems Grokking Streaming Systems is a simple guide to the complex concepts behind streaming systems. This friendly and framework-agnostic tutorial teaches you how to handle real-time events, and even design and build your own streaming job that’s a perfect fit for your needs. Each new idea is carefully explained with diagrams, clear examples, and fun dialogue between perplexed personalities! About the Technology Streaming systems minimize the time between receiving and processing event data, so they can deliver responses in real time. For applications in finance, security, and IoT where milliseconds matter, streaming systems are a requirement. And streaming is hot! Skills on platforms like Spark, Heron, and Kafka are in high demand. About the Book Grokking Streaming Systems introduces real-time event streaming applications in clear, reader-friendly language. This engaging book illuminates core concepts like data parallelization, event windows, and backpressure without getting bogged down in framework-specific details. As you go, you’ll build your own simple streaming tool from the ground up to make sure all the ideas and techniques stick. The helpful and entertaining illustrations make streaming systems come alive as you tackle relevant examples like real-time credit card fraud detection and monitoring IoT services. What's Inside Implement and troubleshoot streaming systems Design streaming systems for complex functionalities Spot networking bottlenecks and resolve backpressure Group data for high-performance systems About the Reader No prior experience with streaming systems is assumed. Examples in Java. About the Authors Josh Fischer and Ning Wang are Apache Committers, and part of the committee for the Apache Heron distributed stream processing engine. Quotes Very well-written and enjoyable. I recommend this book to all software engineers working on data processing. - Apoorv Gupta, Facebook Finally, a much-needed introduction to streaming systems—a must-read for anyone interested in this technology. - Anupam Sengupta, Red Hat Tackles complex topics in a very approachable manner. - Marc Roulleau, GIRO A superb resource for helping you grasp the fundamentals of open-source streaming systems. - Simon Verhoeven, Cronos Explains all the main streaming concepts in a friendly way. Start with this one! - Cicero Zandona, Calypso Technologies

Modern Data Engineering with Apache Spark: A Hands-On Guide for Building Mission-Critical Streaming Applications

Leverage Apache Spark within a modern data engineering ecosystem. This hands-on guide will teach you how to write fully functional applications, follow industry best practices, and learn the rationale behind these decisions. With Apache Spark as the foundation, you will follow a step-by-step journey beginning with the basics of data ingestion, processing, and transformation, and ending up with an entire local data platform running Apache Spark, Apache Zeppelin, Apache Kafka, Redis, MySQL, Minio (S3), and Apache Airflow. Apache Spark applications solve a wide range of data problems from traditional data loading and processing to rich SQL-based analysis as well as complex machine learning workloads and even near real-time processing of streaming data. Spark fits well as a central foundation for any data engineering workload. This book will teach you to write interactive Spark applications using Apache Zeppelin notebooks, write and compilereusable applications and modules, and fully test both batch and streaming. You will also learn to containerize your applications using Docker and run and deploy your Spark applications using a variety of tools such as Apache Airflow, Docker and Kubernetes. ​Reading this book will empower you to take advantage of Apache Spark to optimize your data pipelines and teach you to craft modular and testable Spark applications. You will create and deploy mission-critical streaming spark applications in a low-stress environment that paves the way for your own path to production. ​ What You Will Learn Simplify data transformation with Spark Pipelines and Spark SQL Bridge data engineering with machine learning Architect modular data pipeline applications Build reusable application components and libraries Containerize your Spark applications for consistency and reliability Use Docker and Kubernetes to deploy your Spark applications Speed up application experimentation using Apache Zeppelin and Docker Understand serializable structured data and data contracts Harness effective strategies for optimizing data in your data lakes Build end-to-end Spark structured streaming applications using Redis and Apache Kafka Embrace testing for your batch and streaming applications Deploy and monitor your Spark applications Who This Book Is For Professional software engineers who want to take their current skills and apply them to new and exciting opportunities within the data ecosystem, practicing data engineers who are looking for a guiding light while traversing the many challenges of moving from batch to streaming modes, data architects who wish to provide clear and concise direction for how best to harness anduse Apache Spark within their organization, and those interested in the ins and outs of becoming a modern data engineer in today's fast-paced and data-hungry world

Kafka in Action

Master the wicked-fast Apache Kafka streaming platform through hands-on examples and real-world projects. In Kafka in Action you will learn: Understanding Apache Kafka concepts Setting up and executing basic ETL tasks using Kafka Connect Using Kafka as part of a large data project team Performing administrative tasks Producing and consuming event streams Working with Kafka from Java applications Implementing Kafka as a message queue Kafka in Action is a fast-paced introduction to every aspect of working with Apache Kafka. Starting with an overview of Kafka's core concepts, you'll immediately learn how to set up and execute basic data movement tasks and how to produce and consume streams of events. Advancing quickly, you’ll soon be ready to use Kafka in your day-to-day workflow, and start digging into even more advanced Kafka topics. About the Technology Think of Apache Kafka as a high performance software bus that facilitates event streaming, logging, analytics, and other data pipeline tasks. With Kafka, you can easily build features like operational data monitoring and large-scale event processing into both large and small-scale applications. About the Book Kafka in Action introduces the core features of Kafka, along with relevant examples of how to use it in real applications. In it, you’ll explore the most common use cases such as logging and managing streaming data. When you’re done, you’ll be ready to handle both basic developer- and admin-based tasks in a Kafka-focused team. What's Inside Kafka as an event streaming platform Kafka producers and consumers from Java applications Kafka as part of a large data project About the Reader For intermediate Java developers or data engineers. No prior knowledge of Kafka required. About the Authors Dylan Scott is a software developer in the insurance industry. Viktor Gamov is a Kafka-focused developer advocate. At Confluent, Dave Klein helps developers, teams, and enterprises harness the power of event streaming with Apache Kafka. Quotes The authors have had many years of real-world experience using Kafka, and this book’s on-the-ground feel really sets it apart. - From the foreword by Jun Rao, Confluent Cofounder A surprisingly accessible introduction to a very complex technology. Developers will want to keep a copy close by. - Conor Redmond, InComm Payments A comprehensive and practical guide to Kafka and the ecosystem. - Sumant Tambe, Linkedin It quickly gave me insight into how Kafka works, and how to design and protect distributed message applications. - Gregor Rayman, Cloudfarms

Cassandra: The Definitive Guide, (Revised) Third Edition, 3rd Edition

Imagine what you could do if scalability wasn't a problem. With this hands-on guide, you'll learn how the Cassandra database management system handles hundreds of terabytes of data while remaining highly available across multiple data centers. This revised third edition--updated for Cassandra 4.0 and new developments in the Cassandra ecosystem, including deployments in Kubernetes with K8ssandra--provides technical details and practical examples to help you put this database to work in a production environment. Authors Jeff Carpenter and Eben Hewitt demonstrate the advantages of Cassandra's nonrelational design, with special attention to data modeling. Developers, DBAs, and application architects looking to solve a database scaling issue or future-proof an application will learn how to harness Cassandra's speed and flexibility. Understand Cassandra's distributed and decentralized structure Use the Cassandra Query Language (CQL) and cqlsh (the CQL shell) Create a working data model and compare it with an equivalent relational model Design and develop applications using client drivers Explore cluster topology and learn how nodes exchange data Maintain a high level of performance in your cluster Deploy Cassandra onsite, in the cloud, or with Docker and Kubernetes Integrate Cassandra with Spark, Kafka, Elasticsearch, Solr, and Lucene

Apache Pulsar in Action

Deliver lightning fast and reliable messaging for your distributed applications with the flexible and resilient Apache Pulsar platform. In Apache Pulsar in Action you will learn how to: Publish from Apache Pulsar into third-party data repositories and platforms Design and develop Apache Pulsar functions Perform interactive SQL queries against data stored in Apache Pulsar Apache Pulsar in Action is a comprehensive and practical guide to building high-traffic applications with Pulsar. You’ll learn to use this mature and battle-tested platform to deliver extreme levels of speed and durability to your messaging. Apache Pulsar committer David Kjerrumgaard teaches you to apply Pulsar’s seamless scalability through hands-on case studies, including IOT analytics applications and a microservices app based on Pulsar functions. About the Technology Reliable server-to-server messaging is the heart of a distributed application. Apache Pulsar is a flexible real-time messaging platform built to run on Kubernetes and deliver the scalability and resilience required for cloud-based systems. Pulsar supports both streaming and message queuing, and unlike other solutions, it can communicate over multiple protocols including MQTT, AMQP, and Kafka’s binary protocol. About the Book Apache Pulsar in Action teaches you to build scalable streaming messaging systems using Pulsar. You’ll start with a rapid introduction to enterprise messaging and discover the unique benefits of Pulsar. Following crystal-clear explanations and engaging examples, you’ll use the Pulsar Functions framework to develop a microservices-based application. Real-world case studies illustrate how to implement the most important messaging design patterns. What's Inside Publish from Pulsar into third-party data repositories and platforms Design and develop Apache Pulsar functions Create an event-driven food delivery application About the Reader Written for experienced Java developers. No prior knowledge of Pulsar required. About the Author David Kjerrumgaard is a committer on the Apache Pulsar project. He currently serves as a Developer Advocate for StreamNative, where he develops Pulsar best practices and solutions. Quotes Apache Pulsar in Action is able to seamlessly mix the theory and abstract concepts with the clarity of practical step-by-step examples. I’d recommend to anyone! - Matteo Merli, co-creator of Apache Pulsar Gives readers insights into how the ‘magic’ works… Definitely recommended. - Henry Saputra, Splunk A complete, practical, fun-filled book. - Satej Kumar Sahu, Honeywell A definitive guide that will help you scale your applications. - Alessandro Campeis, Vimar The best book to start working with Pulsar. - Emanuele Piccinelli, Empirix

Kafka: The Definitive Guide, 2nd Edition

Every enterprise application creates data, whether it consists of log messages, metrics, user activity, or outgoing messages. Moving all this data is just as important as the data itself. With this updated edition, application architects, developers, and production engineers new to the Kafka streaming platform will learn how to handle data in motion. Additional chapters cover Kafka's AdminClient API, transactions, new security features, and tooling changes. Engineers from Confluent and LinkedIn responsible for developing Kafka explain how to deploy production Kafka clusters, write reliable event-driven microservices, and build scalable stream processing applications with this platform. Through detailed examples, you'll learn Kafka's design principles, reliability guarantees, key APIs, and architecture details, including the replication protocol, the controller, and the storage layer. You'll examine: Best practices for deploying and configuring Kafka Kafka producers and consumers for writing and reading messages Patterns and use-case requirements to ensure reliable data delivery Best practices for building data pipelines and applications with Kafka How to perform monitoring, tuning, and maintenance tasks with Kafka in production The most critical metrics among Kafka's operational measurements Kafka's delivery capabilities for stream processing systems

Summary Transactions are a necessary feature for ensuring that a set of actions are all performed as a single unit of work. In streaming systems this is necessary to ensure that a set of messages or transformations are all executed together across different queues. In this episode Denis Rystsov explains how he added support for transactions to the Redpanda streaming engine. He discusses the use cases for transactions, the different strategies, semantics, and guarantees that they might need to support, and how his implementation ended up improving the performance of bulk write operations. This is an interesting deep dive into the internals of a high performance streaming engine and the details that are involved in building distributed systems.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Struggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the world’s first end-to-end, fully automated Data Observability Platform! In the same way that application performance monitoring ensures reliable software and keeps application downtime at bay, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, ETL, and business intelligence, reducing time to detection and resolution from weeks or days to just minutes. Start trusting your data with Monte Carlo today! Visit dataengineeringpodcast.com/impact today to save your spot at IMPACT: The Data Observability Summit a half-day virtual event featuring the first U.S. Chief Data Scientist, founder of the Data Mesh, Creator of Apache Airflow, and more data pioneers spearheading some of the biggest movements in data. The first 50 to RSVP with this link will be entered to win an Oculus Quest 2 — Advanced All-In-One Virtual Reality Headset. RSVP today – you don’t want to miss it! Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription Your host is Tobias Macey and today I’m interviewing Denis Rystsov about implementing transactions in the RedPanda streaming engine

Interview

Introduction How did you get involved in the area of data management? Can you quickly recap what RedPanda is and the goals of the project? What are the use cases for transactions in a pub/sub messaging system?

What are the elements of streaming systems that make atomic transactions a complex problem?

What was the motivation for starting down the path of adding transactions to the RedPanda engine?

How did the constraint of supporting the Kafka API influence your implementation strategy for transaction semantics?

Cloud Native Integration with Apache Camel: Building Agile and Scalable Integrations for Kubernetes Platforms

Address the most common integration challenges, by understanding the ins and outs of the choices and exemplifying the solutions with practical examples on how to create cloud native applications using Apache Camel. Camel will be our main tool, but we will also see some complementary tools and plugins that can make our development and testing easier, such as Quarkus, and tools for more specific use cases, such as Apache Kafka and Keycloak. You will learn to connect with databases, create REST APIs, transform data, connect with message oriented software (MOMs), secure your services, and test using Camel. You will also learn software architecture patterns for integration and how to leverage container platforms, such as Kubernetes. This book is suitable for those who are eager to learn an integration tool that fits the Kubernetes world, and who want to explore the integration challenges that can be solved using containers. What You Will Learn Focus on how to solve integration challenges Understand the basics of the Quarkus as it’s the foundation for the application Acquire a comprehensive view on Apache Camel Deploy an application in Kubernetes Follow good practices Who This Book Is For Java developers looking to learn Apache Camel; Apache Camel developers looking to learn more about Kubernetes deployments; software architects looking to study integration patterns for Kubernetes based systems; system administrators (operations teams) looking to get a better understand of how technologies are integrated.

Summary Data lake architectures have largely been biased toward batch processing workflows due to the volume of data that they are designed for. With more real-time requirements and the increasing use of streaming data there has been a struggle to merge fast, incremental updates with large, historical analysis. Vinoth Chandar helped to create the Hudi project while at Uber to address this challenge. By adding support for small, incremental inserts into large table structures, and building support for arbitrary update and delete operations the Hudi project brings the best of both worlds together. In this episode Vinoth shares the history of the project, how its architecture allows for building more frequently updated analytical queries, and the work being done to add a more polished experience to the data lake paradigm.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management You listen to this show to learn about all of the latest tools, patterns, and practices that power data engineering projects across every domain. Now there’s a book that captures the foundational lessons and principles that underly everything that you hear about here. I’m happy to announce I collected wisdom from the community to help you in your journey as a data engineer and worked with O’Reilly to publish it as 97 Things Every Data Engineer Should Know. Go to dataengineeringpodcast.com/97things today to get your copy! When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! RudderStack’s smart customer data pipeline is warehouse-first. It builds your customer data warehouse and your identity graph on your data warehouse, with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and ZenDesk help you go beyond event streaming. With RudderStack you can use all of your customer data to answer more difficult questions and then send those insights to your whole customer data stack. Sign up free at dataengineeringpodcast.com/rudder today. We’ve all been asked to help with an ad-hoc request for data by the sales and marketing team. Then it becomes a critical report that they need updated every week or every day. Then what do you do? Send a CSV via email? Write some Python scripts to automate it? But what about incremental sync, API quotas, error handling, and all of the other details that eat up your time? Today, there is a better way. With Census, just write SQL or plug in your dbt models and start syncing your cloud warehouse to SaaS applications like Salesforce, Marketo, Hubspot, and many more. Go to dataengineeringpodcast.com/census today to get a free 14-day trial. Your host is Tobias Macey and today I’m interviewing Vinoth Chandar about Apache Hudi, a data lake management layer for supporting fast and incremental updates to your tables.

Interview

Introduction How did you get involved in the area of data management? Can you describe what Hudi is and the story behind it? What are the use cases that it is focused on supporting? There have been a number of alternative table formats introduced for data lakes recently. How does Hudi compare to projects like Iceberg, Delta Lake, Hive, etc.? Can you describe how Hudi is architected?

How have the goals and design of Hudi changed or evolved since you first began working on it? If you were to start the whole project over today, what would you do differently?

Can you talk through the lifecycle of a data record as it is ingested, compacted, and queried in a Hudi deployment? One of the capabilities that is interesting to explore is support for arbitrary record deletion. Can you talk through why this is a challenging operation in data lake architectures?

How does Hudi make that a tractable problem?

What are the data platform components that are needed to support an installation of Hudi? What is involved in migrating an existing data lake to use Hudi?

How would someone approach supporting heterogeneous table formats in their lake?

As someone who has invested a lot of time in technologies for supporting data lakes, what are your thoughts on the tradeoffs of data lake vs data warehouse and the current trajectory of the ecosystem? What are the most interesting, innovative, or unexpected ways that you have seen Hudi used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Hudi? When is Hudi the wrong choice? What do you have planned for the future of Hudi?

Contact Info

Linkedin Twitter

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

Thank you for listening! Don’t forget to check out our other show, Podcast.init to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat

Links

Hudi Docs Hudi Design & Architecture Incremental Processing CDC == Change Data Capture

Podcast Episodes

Oracle GoldenGate Voldemort Kafka Hadoop Spark HBase Parquet Iceberg Table Format

Data Engineering Episode

Hive ACID Apache Kudu

Podcast Episode

Vertica Delta Lake

Podcast Episode

Optimistic Concurrency Control MVCC == Multi-Version Concurrency Control Presto Flink

Podcast Episode

Trino

Podcast Episode

Gobblin LakeFS

Podcast Episode

Nessie

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Support Data Engineering Podcast

Summary Everyone expects data to be transmitted, processed, and updated instantly as more and more products integrate streaming data. The technology to make that possible has been around for a number of years, but the barriers to adoption have still been high due to the level of technical understanding and operational capacity that have been required to run at scale. Datastax has recently introduced a new managed offering for Pulsar workloads in the form of Astra Streaming that lowers those barriers and make stremaing workloads accessible to a wider audience. In this episode Prabhat Jha and Jonathan Ellis share the work that they have been doing to integrate streaming data into their managed Cassandra service. They explain how Pulsar is being used by their customers, the work that they have done to scale the administrative workload for multi-tenant environments, and the challenges of operating such a data intensive service at large scale. This is a fascinating conversation with a lot of useful lessons for anyone who wants to understand the operational aspects of Pulsar and the benefits that it can provide to data workloads.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management You listen to this show to learn about all of the latest tools, patterns, and practices that power data engineering projects across every domain. Now there’s a book that captures the foundational lessons and principles that underly everything that you hear about here. I’m happy to announce I collected wisdom from the community to help you in your journey as a data engineer and worked with O’Reilly to publish it as 97 Things Every Data Engineer Should Know. Go to dataengineeringpodcast.com/97things today to get your copy! When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! RudderStack’s smart customer data pipeline is warehouse-first. It builds your customer data warehouse and your identity graph on your data warehouse, with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and ZenDesk help you go beyond event streaming. With RudderStack you can use all of your customer data to answer more difficult questions and then send those insights to your whole customer data stack. Sign up free at dataengineeringpodcast.com/rudder today. We’ve all been asked to help with an ad-hoc request for data by the sales and marketing team. Then it becomes a critical report that they need updated every week or every day. Then what do you do? Send a CSV via email? Write some Python scripts to automate it? But what about incremental sync, API quotas, error handling, and all of the other details that eat up your time? Today, there is a better way. With Census, just write SQL or plug in your dbt models and start syncing your cloud warehouse to SaaS applications like Salesforce, Marketo, Hubspot, and many more. Go to dataengineeringpodcast.com/census today to get a free 14-day trial. Your host is Tobias Macey and today I’m interviewing Prabhat Jha and Jonathan Ellis about Astra Streaming, a cloud-native streaming platform built on Apache Pulsar

Interview

Introduction

How did you get involved in the area of data management?

Can you describe what the Astra platform is and the story behind it?

How does streaming fit into your overall product vision and the needs of your customers?

What was your selection process/criteria for adopting a streaming engine to complement your existing technology investment?

What are the core use cases that you are aiming to support with Astra Streaming?

Can you describe the architecture and automation of your hosted platform for Pulsar?

What are the integration points that you have built to make it work well with Cassandra?

What are some of the additional tools that you have added to your distribution of Pulsar to simplify operation and use?

What are some of the sharp edges that you have had to sand down as you have scaled up your usage of Pulsar?

What is the process for someone to adopt and integrate with your Astra Streaming service?

How do you handle migrating existing projects, particularly if they are using Kafka currently?

One of the capabilities that you highlight on the product page for Astra Streaming is the ability to execute machine learning workflows on data in flight. What are some of the supporting systems that are necessary to power that workflow?

What are the capabilities that are built into Pulsar that simplify the operational aspects of streaming ML?

What are the ways that you are engaging with and supporting the Pulsar community?

What are the near to medium term elements of the Pulsar roadmap that you are working toward and excited to incorporate into Astra?

What are the most interesting, innovative, or unexpected ways that you have seen Astra used?

What are the most interesting, unexpected, or challenging lessons that you have learned while working on Astra?

When is Astra the wrong choice?

What do you have planned for the future of Astra?

Contact Info

Prabhat

LinkedIn @prabhatja on Twitter prabhatja on GitHub

Jonathan

LinkedIn @spyced on Twitter

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

Pulsar

Podcast Episode Streamnative Episode

Datastax Astra Streaming Datastax Astra DB Luna Streaming Distribution Datastax Cassandra Kesque (formerly Kafkaesque) Kafka RabbitMQ Prometheus Grafana Pulsar Heartbeat Pulsar Summit Pulsar Summit Presentation on Kafka Connectors Replicated Chaos Engineering Fallout chaos engineering tools Jepsen

Podcast Episode

Jack VanLightly

BookKeeper TLA+ Model

Change Data Capture

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Support Data Engineering Podcast

Summary Data Engineering is a broad and constantly evolving topic, which makes it difficult to teach in a concise and effective manner. Despite that, Daniel Molnar and Peter Fabian started the Pipeline Academy to do exactly that. In this episode they reflect on the lessons that they learned while teaching the first cohort of their bootcamp how to be effective data engineers. By focusing on the fundamentals, and making everyone write code, they were able to build confidence and impart the importance of context for their students.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Are you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on. The data you’re looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for Reverse ETL today. Get started for free at dataengineeringpodcast.com/hightouch. Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription Your host is Tobias Macey and today I’m interviewing Daniel Molnar and Peter Fabian about the lessons that they learned from their first cohort at the Pipeline data engineering academy

Interview

Introduction How did you get involved in the area of data management? Can you start by sharing the curriculum and learning goals for the students? How did you set a common baseline for all of the students to build from throughout the program?

What was your process for determining the structure of the tasks and the tooling used?

What were some of the topics/tools that the students had the most difficulty with?

What topics/tools were the easiest to grasp?

What are some difficulties that you encountered while trying to teach different concepts? How did you deal with the tension of teaching the fundamentals while tying them to toolchains that hiring managers are looking for? What are the successes that you had with this cohort and what changes are you making to your approach/curriculum to build on them? What are some of the failures that you encountered and what lessons have you taken from them? How did the pandemic impact your overall plan and execution of the initial cohort? What were the skills that you focused on for interview preparation? What level of ongoing support/engagement do you have with students once they complete the curriculum? What are the most interesting, innovative, or unexpected solutions that you saw from your students? What are the most interesting, unexpected, or challenging lessons that you have learned while working with your first cohort? When is a bootcamp the wrong approach for skill development? What do you have planned for the future of the Pipeline Academy?

Contact Info

Daniel

LinkedIn Website @soobrosa on Twitter

Peter

LinkedIn

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

Pipeline Academy

Blog

Scikit Pandas Urchin Kafka Three "C"s – Context, Confidence, and Code Prefect

Podcast Episode

Great Expectations

Podcast Episode Podcast.init Episode

Docker Kubernetes Become a Data Engineer On A Shoestring James Mickens

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Support Data Engineering Podcast

Kafka As a Platform: The Ecosystem from The Ground Up by Robin Moffatt

Big Data Europe Onsite and online on 22-25 November in 2022 Learn more about the conference: https://bit.ly/3BlUk9q

Join our next Big Data Europe conference on 22-25 November in 2022 where you will be able to learn from global experts giving technical talks and hand-on workshops in the fields of Big Data, High Load, Data Science, Machine Learning and AI. This time, the conference will be held in a hybrid setting allowing you to attend workshops and listen to expert talks on-site or online.

Real-Time Stream Processing for Insurance & Health Care Kafka by Cristian Prevedello

Big Data Europe Onsite and online on 22-25 November in 2022 Learn more about the conference: https://bit.ly/3BlUk9q

Join our next Big Data Europe conference on 22-25 November in 2022 where you will be able to learn from global experts giving technical talks and hand-on workshops in the fields of Big Data, High Load, Data Science, Machine Learning and AI. This time, the conference will be held in a hybrid setting allowing you to attend workshops and listen to expert talks on-site or online.

Processing Billions of Events a Day Using Kafka and Kafka Streams by Carlos Manuel Duclos-Vergara

Big Data Europe Onsite and online on 22-25 November in 2022 Learn more about the conference: https://bit.ly/3BlUk9q

Join our next Big Data Europe conference on 22-25 November in 2022 where you will be able to learn from global experts giving technical talks and hand-on workshops in the fields of Big Data, High Load, Data Science, Machine Learning and AI. This time, the conference will be held in a hybrid setting allowing you to attend workshops and listen to expert talks on-site or online.