talk-data.com talk-data.com

David Brochart

Speaker

David Brochart

2

talks

Filter by Event / Source

Talks & appearances

2 activities · Newest first

Search activities →
Parallel processing using CRDTs

Beyond embarrassingly parallel processing problems, data must be shared between workers for them to do something useful. This can be done by: - sharing memory between threads, with the issue of preventing access to shared data to avoid race conditions. - copying memory to subprocesses, with the challenge of synchronizing data whenever it is mutated.

In Python, using threads is not an option because of the GIL (global interpreter lock), which prevents true parallelism. This might change in the future with the removal of the GIL, but usual problems with multithreading will appear, such as using locks and managing their complexity. Subprocesses don't suffer from the GIL, but usually need to access a database for sharing data, which is often too slow. Algorithms such as HAMT (hash array mapped trie) have been used to efficiently and safely share data stored in immutable data structures, removing the need for locks. In this talk we will show how CRDTs (conflict-free replicated data type) can be used for the same purpose.

The Jupyter stack has undergone a significant transformation in recent years with the integration of collaborative editing features: users can now modify a shared document and see each other's changes in real time, with a user experience akin to that of Google Docs. The underlying technology uses a special data structure called Conflict-free Replicated Data Types (CRDTs), that automatically resolves conflicts when concurrent changes are made. This allows data to be distributed rather than centralized in a server, letting clients work as if data was local rather than remote. In this talk, we look at new possibilities that CRDTs can unlock, and how they are redefining Jupyter's architecture. Different use cases are presented: a suggestion system similar to Google Doc's, a chat system allowing collaboration with an AI agent, an execution model allowing full notebook state recovery, a collaborative widget model. We also look at the benefits of using CRDTs in JupyterLite, where users can interact without a server. This may be a great example of a distributed system where every user owns their data and shares them with their peers.