Which implementation of YStore should we use for a remote distributed YStore?

99sparsh commented 1 year ago

I am a software engineer at LinkedIn and my team is looking to leverage the functionalities of Real Time Collaboration (RTC) based on YJS within our custom Jupyter ecosystem. Our use case requires RTC to work between multiple clients each of whom are on their own separate server (K8s pod).

To do this, we tried adding a MySQL implementation [closest to the SQLite implementation here] that stores the Y-Updates in a remote MySQL DB so that each of the servers use this same DB for content syncing. The syncing does work but is highly unstable where sometimes the sync does not work or the cursor automatically moves to the beginning of the document or some other flaky edits are noticed.

Could this be due to MySQL not being the right choice for this use case? We would appreciate any thoughts on this or suggestions on how to better choose a remote distributed YStore.

davidbrochart commented 1 year ago

my team is looking to leverage the functionalities of Real Time Collaboration (RTC) based on YJS within our custom Jupyter ecosystem.

Nice, I would be interested to know more, is it open-source?

Our use case requires RTC to work between multiple clients each of whom are on their own separate server (K8s pod).

I think having multiple servers is the issue here. Synchronizing them "through the YStore" is problematic, it was not intended to play that role. A possible solution would be to only have one YStore in one server. The other servers would appear as clients to this server, connecting their internal YDoc through a WebsocketProvider. You can have a look at jpterm for an example to do this. Feel free to reach out if you need more help.

99sparsh commented 1 year ago

@davidbrochart thanks for the response. To give you more context:

My team owns a data analytics platform built on top of JupyterHub + JupyterLab in a Kubernetes environment. It is not open source yet but we do have plans for the same in a phased manner.

For each user that logs in, we spawn a new K8s pod for them within which a JupyterLab instance is run. Since there is a new k8s pod for every user, it ensures one user is totally isolated from any other user. The pods are ephemeral and when a user logs out their pod is deleted hence we cannot use the user pod as a source for truth for other users.

Now we want to enable RTC for our users. Since this is a JupyterLab concept and in our case we have different Lab instances, the default SQLite implementation does not work for us and we are looking into the detailed implementation of RTC with YJS.

I agree with you that we might be trying to make it work in a way that was not intended, however since we see that the synchronization does happen via a remote MySQL DB albeit unstable, we are looking to reaffirm if this approach would work possibly with a different DB faster than MySQL or if there are any glaring red flags which would be a limitation that we have not been able to spot in the code yet.

davidbrochart commented 1 year ago

The pods are ephemeral and when a user logs out their pod is deleted hence we cannot use the user pod as a source for truth for other users.

This looks a lot like what JupyterHub does. But the way RTC is designed in JupyterLab is through a central server, to which multiple clients connect. There is some kind of conflict between complete user isolation and collaboration. I don't know exactly what you want to isolate. If it's the file system, then on what users are going to collaborate? Users should at least connect to the same WebSocket that we use for communicating Y messages. Otherwise they won't be able to share awareness, which is why you see cursors flying around for instance. I suggest looking at Jupyverse for a modular way of building Jupyter services. Or have a look JupyterLite, which has its own RTC implementation based on WebRTC, hence completely distributed.

99sparsh commented 1 year ago

Yes, we are using JupyterHub to manage the user pods. Each user pod has their own JupyterLab instance. By using this mechanism we are isolating the execution environment where each user can use a custom environment (docker image) and their executions will be handled by their own kernel in their own pod's assigned CPU and memory resources.

The file system is abstracted out and is shared amongst the users so they can edit contents but execution will happen in their own pod.

Yes, we are aware that users need to be on the same WebSocket for the identities of active users to be visible on the UI. We were first trying to address the problem of content sync (without awareness would be a step forward as well) hence experimenting with sharing the YStore.

Thanks for the pointers on Jupyterverse and JupyterLite, we will look into that direction to understand more on how it works and if it can be modified for our use case!

y-crdt / ypy-websocket

Which implementation of YStore should we use for a remote distributed YStore? #78