talk-data.com talk-data.com

Y

Speaker

Yingjun Wu

11

talks

CEO, RisingWave

Bio from: Data Streaming Lakehouse Tour (Paris)

Filter by Event / Source

Talks & appearances

11 activities · Newest first

Search activities →

At first, streaming Postgres changes into Iceberg feels like a no-brainer. You spin up Debezium or Kafka Connect, point it at Iceberg, and it all looks boringly straightforward. The surprise comes once the pipeline hits production. Replication slots vanish and start filling up WAL space, LSNs don't line up and cause duplicates or gaps, and Iceberg sinks fail in ways that push back all the way to your primary database. Then you throw in schema changes, backfills, and compaction, and suddenly the \"boring\" pipeline becomes a source of late-night firefights. In this talk, I'll share real stories from running Postgres to Iceberg CDC pipelines in production. We'll look at the unexpected problems that show up, why they happen, and the strategies that actually helped keep things stable. If you've ever thought of Postgres -> Iceberg as just plumbing, this session will show you why it's not so boring after all.

Stream processing systems have traditionally relied on local storage engines such as RocksDB to achieve low latency. While effective in single-node setups, this model doesn't scale well in the cloud, where elasticity and separation of compute and storage are essential. In this talk, we'll explore how RisingWave rethinks the architecture by building directly on top of S3 while still delivering sub-100 ms latency. At the core is Hummock, a log-structured state engine designed for object storage. Hummock organizes state into a three-tier hierarchy: in-memory cache for the hottest keys, disk cache managed by Foyer for warm data, and S3 as the persistent cold tier. This approach ensures queries never directly hit S3, avoiding its variable performance. We'll also examine how remote compaction offloads heavy maintenance tasks from query nodes, eliminating interference between user queries and background operations. Combined with fine-grained caching policies and eviction strategies, this architecture enables both consistent query performance and cloud-native elasticity. Attendees will walk away with a deeper understanding of how to design streaming systems that balance durability, scalability, and low latency in an S3-based environment.

Everyone makes streaming sound simple – until you try bolting it onto your batch pipeline and it blows up. This talk skips the marketing gloss and gets into the real work: how to make batch and streaming actually play nice. I’ll walk through the essentials, then get into the messy parts – compaction, primary key updates, exactly-once delivery, and keeping your compute bill from spiraling. You’ll learn how to plug RisingWave into your existing stack and get real-time results without rewriting everything. It’s based on what we’ve seen in production – real problems, real fixes, no buzzwords.

Everyone makes streaming sound simple - until you try bolting it onto your batch pipeline and it blows up. This talk skips the marketing gloss and gets into the real work: how to make batch and streaming actually play nice. I will walk through the essentials, then get into the messy parts - compaction, primary key updates, exactly-once delivery, and keeping your compute bill from spiraling. You will learn how to plug RisingWave into your existing stack and get real-time results without rewriting everything. It is based on what we have seen in production - real problems, real fixes, no buzzwords.

In this talk, we will discuss how we implemented the Iceberg connector in Rust, replacing the original Java-wrapped version to address performance bottlenecks in serialization and memory usage. By following the Apache Iceberg specification, we built a native Rust connector that supports Iceberg’s advanced features, such as multi-catalog compatibility and streaming updates. We’ve contributed this new version to the apache/iceberg-rust repository, and will share insights into the architectural improvements and best practices for leveraging Iceberg in streaming environments.

Summary

Stream processing systems have long been built with a code-first design, adding SQL as a layer on top of the existing framework. RisingWave is a database engine that was created specifically for stream processing, with S3 as the storage layer. In this episode Yingjun Wu explains how it is architected to power analytical workflows on continuous data flows, and the challenges of making it responsive and scalable.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. Dagster offers a new approach to building and running data platforms and data pipelines. It is an open-source, cloud-native orchestrator for the whole development lifecycle, with integrated lineage and observability, a declarative programming model, and best-in-class testability. Your team can get up and running in minutes thanks to Dagster Cloud, an enterprise-class hosted solution that offers serverless and hybrid deployments, enhanced security, and on-demand ephemeral test deployments. Go to dataengineeringpodcast.com/dagster today to get started. Your first 30 days are free! Your host is Tobias Macey and today I'm interviewing Yingjun Wu about the RisingWave database and the intricacies of building a stream processing engine on S3

Interview

Introduction How did you get involved in the area of data management? Can you describe what RisingWave is and the story behind it? There are numerous stream processing engines, near-real-time database engines, streaming SQL systems, etc. What is the specific niche that RisingWave addresses?

What are some of the platforms/architectures that teams are replacing with RisingWave?

What are some of the unique capabilities/use cases that RisingWave provides over other offerings in the current ecosystem? Can you describe how RisingWave is architected and implemented?

How have the design and goals/scope changed since you first started working on it? What are the core design philosophies that you rely on to prioritize the ongoing development of the project?

What are the most complex engineering challenges that you have had to address in the creation of RisingWave? Can you describe a typical workflow for teams that are building on top of RisingWave?

What are the user/developer experience elements that you have prioritized most highly?

What are the situations where RisingWave can/should be a system of record vs. a point-in-time view of data in transit, with a data warehouse/lakehouse as the longitudinal storage and query engine? What are the most interesting, innovative, or unexpected ways that you have seen RisingWave used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on RisingWave? When is RisingWave the wrong choice? What do you have planned for the future of RisingWave?

Contact Info

yingjunwu on GitHub Personal Website LinkedIn

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

Thank you for listening! Don't forget to check out our other shows.

Yingjun Wu: Unlocking Real-time Insights: Enhancing Your Databases With Stream Processing

Join Yingjun Wu as we unlock the power of real-time insights in 'Unlocking Real-time Insights: Enhancing Your Databases With Stream Processing.' 🚀 Explore how to leverage Change Data Capture (CDC) and modern SQL streaming databases to revolutionize your data analytics, and discover the magic of materialized views for instant, actionable insights. 📈💡 #RealTimeInsights #streamprocessing

✨ H I G H L I G H T S ✨

🙌 A huge shoutout to all the incredible participants who made Big Data Conference Europe 2023 in Vilnius, Lithuania, from November 21-24, an absolute triumph! 🎉 Your attendance and active participation were instrumental in making this event so special. 🌍

Don't forget to check out the session recordings from the conference to relive the valuable insights and knowledge shared! 📽️

Once again, THANK YOU for playing a pivotal role in the success of Big Data Conference Europe 2023. 🚀 See you next year for another unforgettable conference! 📅 #BigDataConference #SeeYouNextYear

Yingjun Wu: Real time OLAP and Stream Processing Friends or Foes

Join Yingjun Wu in unlocking the power of real-time insights through 'Unlocking Real-time Insights: Enhancing Your Databases With Stream Processing.' 📊🔓 Discover how to harness Change Data Capture (CDC) and modern SQL streaming databases to drive real-time analytics, enabling businesses to stay ahead in the data-driven world. 🚀💡 #RealTimeInsights #streamprocessing

✨ H I G H L I G H T S ✨

🙌 A huge shoutout to all the incredible participants who made Big Data Conference Europe 2023 in Vilnius, Lithuania, from November 21-24, an absolute triumph! 🎉 Your attendance and active participation were instrumental in making this event so special. 🌍

Don't forget to check out the session recordings from the conference to relive the valuable insights and knowledge shared! 📽️

Once again, THANK YOU for playing a pivotal role in the success of Big Data Conference Europe 2023. 🚀 See you next year for another unforgettable conference! 📅 #BigDataConference #SeeYouNextYear

Panel Discussion | Current Methods for Building Data Pipelines

Join the insightful panel discussion featuring Antonio Murgia, Bongani Shongwe, Elena Lazovik, Ravi Bhatt, and Yingjun Wu as they explore the Current Methods for Building Data Pipelines. 📊🗣️ Gain valuable insights into the evolving landscape of data pipeline development and best practices. 🌐🚀 #DataPipelines #paneldiscussion

✨ H I G H L I G H T S ✨

🙌 A huge shoutout to all the incredible participants who made Big Data Conference Europe 2023 in Vilnius, Lithuania, from November 21-24, an absolute triumph! 🎉 Your attendance and active participation were instrumental in making this event so special. 🌍

Don't forget to check out the session recordings from the conference to relive the valuable insights and knowledge shared! 📽️

Once again, THANK YOU for playing a pivotal role in the success of Big Data Conference Europe 2023. 🚀 See you next year for another unforgettable conference! 📅 #BigDataConference #SeeYouNextYear

Abstract: RisingWave is an open-source streaming database designed from scratch for the cloud. It implemented a Snowflake-style storage-compute separation architecture to reduce performance cost, and provides users with a PostgreSQL-like experience for stream processing. Over the last three years, RisingWave has evolved from a one-person project to a rapidly-growing product deployed by nearly 100 enterprises and startups. But the journey of building RisingWave is full of challenges. In this talk, I'd like to share with you lessons we've gained from four dimensions: 1) the decoupled compute-storage architecture, 2) the balances between stream processing and OLAP, 3) the Rust ecosystem, and 4) the product positioning. I will dive deep into technical details and then share with you my views on the future of stream processing.