Data Streaming

Achieving Sub-100 ms Real-Time Stream Processing with an S3-Native Architecture

2025-09-16 · Data Builders’ Evening: Architecture, Engineering & Beyond

talk

by Yingjun Wu (RisingWave Labs)

S3 hummock log-structured state engine object storage

Stream processing systems have traditionally relied on local storage engines such as RocksDB to achieve low latency. While effective in single-node setups, this model doesn't scale well in the cloud, where elasticity and separation of compute and storage are essential. In this talk, we'll explore how RisingWave rethinks the architecture by building directly on top of S3 while still delivering sub-100 ms latency. At the core is Hummock, a log-structured state engine designed for object storage. Hummock organizes state into a three-tier hierarchy: in-memory cache for the hottest keys, disk cache managed by Foyer for warm data, and S3 as the persistent cold tier. This approach ensures queries never directly hit S3, avoiding its variable performance. We'll also examine how remote compaction offloads heavy maintenance tasks from query nodes, eliminating interference between user queries and background operations. Combined with fine-grained caching policies and eviction strategies, this architecture enables both consistent query performance and cloud-native elasticity. Attendees will walk away with a deeper understanding of how to design streaming systems that balance durability, scalability, and low latency in an S3-based environment.

Tackling Real Time Streaming Data With SQL Using RisingWave

2024-02-04 · Data Engineering Podcast Listen

podcast_episode

by Yingjun Wu (RisingWave Labs) , Tobias Macey

AI/ML Analytics Cloud Computing Dagster Data Engineering Data Lake Data Lakehouse Data Management Delta DWH GitHub Hudi +5 more

Summary

Stream processing systems have long been built with a code-first design, adding SQL as a layer on top of the existing framework. RisingWave is a database engine that was created specifically for stream processing, with S3 as the storage layer. In this episode Yingjun Wu explains how it is architected to power analytical workflows on continuous data flows, and the challenges of making it responsive and scalable.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. Dagster offers a new approach to building and running data platforms and data pipelines. It is an open-source, cloud-native orchestrator for the whole development lifecycle, with integrated lineage and observability, a declarative programming model, and best-in-class testability. Your team can get up and running in minutes thanks to Dagster Cloud, an enterprise-class hosted solution that offers serverless and hybrid deployments, enhanced security, and on-demand ephemeral test deployments. Go to dataengineeringpodcast.com/dagster today to get started. Your first 30 days are free! Your host is Tobias Macey and today I'm interviewing Yingjun Wu about the RisingWave database and the intricacies of building a stream processing engine on S3

Interview

Introduction How did you get involved in the area of data management? Can you describe what RisingWave is and the story behind it? There are numerous stream processing engines, near-real-time database engines, streaming SQL systems, etc. What is the specific niche that RisingWave addresses?

What are some of the platforms/architectures that teams are replacing with RisingWave?

What are some of the unique capabilities/use cases that RisingWave provides over other offerings in the current ecosystem? Can you describe how RisingWave is architected and implemented?

How have the design and goals/scope changed since you first started working on it? What are the core design philosophies that you rely on to prioritize the ongoing development of the project?

What are the most complex engineering challenges that you have had to address in the creation of RisingWave? Can you describe a typical workflow for teams that are building on top of RisingWave?

What are the user/developer experience elements that you have prioritized most highly?

What are the situations where RisingWave can/should be a system of record vs. a point-in-time view of data in transit, with a data warehouse/lakehouse as the longitudinal storage and query engine? What are the most interesting, innovative, or unexpected ways that you have seen RisingWave used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on RisingWave? When is RisingWave the wrong choice? What do you have planned for the future of RisingWave?

Contact Info

yingjunwu on GitHub Personal Website LinkedIn

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

Thank you for listening! Don't forget to check out our other shows.

Yingjun Wu: Unlocking Real-time Insights: Enhancing Your Databases With Stream Processing

2023-12-07 · DATA MINER Big Data Europe Conference 2020 Watch

video

by Yingjun Wu (RisingWave Labs)

Analytics Big Data Data Analytics SQL

Join Yingjun Wu as we unlock the power of real-time insights in 'Unlocking Real-time Insights: Enhancing Your Databases With Stream Processing.' 🚀 Explore how to leverage Change Data Capture (CDC) and modern SQL streaming databases to revolutionize your data analytics, and discover the magic of materialized views for instant, actionable insights. 📈💡 #RealTimeInsights #streamprocessing

✨ H I G H L I G H T S ✨

🙌 A huge shoutout to all the incredible participants who made Big Data Conference Europe 2023 in Vilnius, Lithuania, from November 21-24, an absolute triumph! 🎉 Your attendance and active participation were instrumental in making this event so special. 🌍

Don't forget to check out the session recordings from the conference to relive the valuable insights and knowledge shared! 📽️

Once again, THANK YOU for playing a pivotal role in the success of Big Data Conference Europe 2023. 🚀 See you next year for another unforgettable conference! 📅 #BigDataConference #SeeYouNextYear

Yingjun Wu: Real time OLAP and Stream Processing Friends or Foes

2023-12-05 · DATA MINER Big Data Europe Conference 2020 Watch

video

by Yingjun Wu (RisingWave Labs)

Analytics Big Data SQL

Join Yingjun Wu in unlocking the power of real-time insights through 'Unlocking Real-time Insights: Enhancing Your Databases With Stream Processing.' 📊🔓 Discover how to harness Change Data Capture (CDC) and modern SQL streaming databases to drive real-time analytics, enabling businesses to stay ahead in the data-driven world. 🚀💡 #RealTimeInsights #streamprocessing

✨ H I G H L I G H T S ✨

🙌 A huge shoutout to all the incredible participants who made Big Data Conference Europe 2023 in Vilnius, Lithuania, from November 21-24, an absolute triumph! 🎉 Your attendance and active participation were instrumental in making this event so special. 🌍

Don't forget to check out the session recordings from the conference to relive the valuable insights and knowledge shared! 📽️

Once again, THANK YOU for playing a pivotal role in the success of Big Data Conference Europe 2023. 🚀 See you next year for another unforgettable conference! 📅 #BigDataConference #SeeYouNextYear

On the Journey of Redefining Stream Processing: What We Learned from Building RisingWave?

2023-11-30 · Berlin Open Source Data Infrastructure Meetup - November 2023

talk

by Yingjun Wu (RisingWave Labs)

Cloud Computing Rust Snowflake postgresql

Abstract: RisingWave is an open-source streaming database designed from scratch for the cloud. It implemented a Snowflake-style storage-compute separation architecture to reduce performance cost, and provides users with a PostgreSQL-like experience for stream processing. Over the last three years, RisingWave has evolved from a one-person project to a rapidly-growing product deployed by nearly 100 enterprises and startups. But the journey of building RisingWave is full of challenges. In this talk, I'd like to share with you lessons we've gained from four dimensions: 1) the decoupled compute-storage architecture, 2) the balances between stream processing and OLAP, 3) the Rust ecosystem, and 4) the product positioning. I will dive deep into technical details and then share with you my views on the future of stream processing.

talk-data.com

Activity Trend

Top Events

Top Speakers

Achieving Sub-100 ms Real-Time Stream Processing with an S3-Native Architecture

Tackling Real Time Streaming Data With SQL Using RisingWave

Yingjun Wu: Unlocking Real-time Insights: Enhancing Your Databases With Stream Processing

Yingjun Wu: Real time OLAP and Stream Processing Friends or Foes

On the Journey of Redefining Stream Processing: What We Learned from Building RisingWave?