Open fcollonval opened 2 years ago
These are very good points @fcollonval, thanks for pointing that out. I'll open a PR for that. It also goes in the same direction as #38. cc @ellisonbg
I would go even further than what @fcollonval is describing. I don't think that we even need to gather at all between updates to the peers and saving to the backend. We could do:
async for message in websocket:
await self._process_queue.put(message)
await self._broadcast_queue.put(message)
and then have two completely independent asyncio tasks,
_process_queue
.
message = await self._process_queue.get()
await process_message(message, room.ydoc, websocket)
In doing so, we will still prevent one of the two tasks from stopping if the other one is (momentarily) choking on updates and having a longer queue.
Another thing we could experiment with is to synchronously use put_nowait
instead of put
. If queue.Full
is raised for one of the two Queues, this means that it is not beeing emptied quickly enough and the server is choking on Y updates. This may be a good way to detect a performance issue in either of those tasks.
@afshin
Great discussion! I like the idea of @SylvainCorlay of having two async queues for processing and broadcasting the messages. Another concern that I have is the presence of blocking calls in the server or server extensions. Those would quickly slow everything down. We should probably spend some time profiling the server to understand where blocking calls are happening.
Another concern that I have is the presence of blocking calls in the server or server extensions.
It could be interesting to compare with Jupyverse. In particular, the file ID service will be fully async, while jupyter-server-fileid
is not async.
Yet another alternative to @SylvainCorlay's proposal is to just create background tasks for processing messages, to update our internal state and to broadcast to clients. They would be like "fire and forget" tasks. Indeed, the idea behind putting the updates in a queue before processing them seems to be that they have to be processed in order, but one of the characteristics of CRDTs is that they can be processed out of order.
I made these changes part of #38.
In @SylvainCorlay's proposal, a task consumes the _broadcast_queue
and forwards the updates to the peers, I suppose sequentially? If so, then we have the same potential issue of one slow client slowing down the whole broadcast. To solve that, we would need to forward the updates to the peers in individual tasks, which is what I did in #38.
Let me know if I'm missing something.
Would Websocket.send
block the processing of the broadcast queue for a slow client?
ypy-websocket
is agnostic to the WebSocket implementation.
In jupyverse we use websockets, where send
is async, which suggests that it's not that fast, maybe checking for connection and eventually timing out. So I don't think queuing calls to send
is a good idea.
In jupyter-server we use Tornado's WebSocket implementation, where write_message
is not async. I don't know if they use threads under the hood.
In both implementation, I don't think it's safe to assume that sending data on a WebSocket is fast.
In jupyverse we use websockets, where send is async, which suggests that it's not that fast, maybe checking for connection and eventually timing out.
OK, I did not know that send
could potentially be slow. Thanks for the explanation.
The publication of update is done task at a time:
https://github.com/y-crdt/ypy-websocket/blob/357d2a6bf2e221100f09fe01c3c6fc1e9534d472/ypy_websocket/websocket_server.py#L113-L123
We should do those update concurrently in a
asyncio.gather
to get them as quickly as possible and not be stuck by a slow client.Moreover we should update with higher priority the document because it is the reference to be uploaded by any new clients before receiving deltas.
I think that part of the code is partly responsible for a data lost case seen when 20 people were collaborating simultaneously. What happen is some clients (cannot know if all) were still receiving the deltas. But the file on disk (that is a regular dump of the in-memory ydoc) stops updated. And if a new client connects, the document at the latest dump version was the one pushed.
My best guess (unfortunately we did not see any error) is that one client was blocking and so the code after the client loop was never executed.
cc: @davidbrochart @hbcarlos