squash doc history in separate task

dlqqq commented 1 year ago

Squashes document history in a separate scheduled task rather than on write. This will significantly improve performance for document writes in the case where the time since last update exceeds the document TTL.

Open concerns

Is there any good way of stubbing out asyncio.sleep() in the unit tests? Testing this is tricky because it's not as simple as doing unittest.mock.patch("asyncio.sleep"), since that causes the document history to be squashed on every write. What I'm looking for is a utility that lets me imperatively advance the event loop clock, e.g. loop.advance(123). I found asynctest, but it doesn't seem to work and is abandoned.

davidbrochart commented 1 year ago

I like that we don't need to query the last timestamp anymore.This should now make it easy to implement document TTL for FileYStore. Do you want to do it? One thing I'm concerned about is what happens if there is a write at the same time as we're squashing updates? Could you add a test for that?

dlqqq commented 1 year ago

@davidbrochart

I like that we don't need to query the last timestamp anymore.This should now make it easy to implement document TTL for FileYStore. Do you want to do it?

Oh yeah, totally forgot about that. Let me do so.

One thing I'm concerned about is what happens if there is a write at the same time as we're squashing updates? Could you add a test for that?

Should be the exact same behavior as a write occurring after a squash, since aiosqlite essentially just stores SQL statements in a queue and polls from it whenever it's finished executing the current one. But yes, I can add a test for this.

dlqqq commented 1 year ago

@davidbrochart Wait, we still need a timestamp if we want to handle the case where the server is not kept alive during a document's TTL. Take this scenario:

add a bunch of updates
shut down the server
repeat

In this scenario, the document history grows without bound. I think we need to handle this case as it's fairly common for users running Jupyter locally on a laptop (in which case disk space is also tight), but I'd like @ellisonbg's opinion on this matter as well.

dlqqq commented 1 year ago

To handle the above case, we'd need to squash on __init__() and do the same query we were doing before.

ellisonbg commented 1 year ago

I agree we do need the timestamp in cases where the server restarts, but why wouldn't we need it otherwise? I am trying to understand the logic here. My idea of how this would work is this:

Every so often (configurable) the background task check to see how long ago the last update for a document was made.
If that time interval is longer than the TTL (also configurable) delete the history.

This logic is needed to ensure that this cleanup doesn't happen while users are editing it.

ellisonbg commented 1 year ago

In the current state of the code, is the idea that you simply count the number of updates to the doc and if it hasn't changed in the TTL then you collapse?

dlqqq commented 1 year ago

@ellisonbg The implementation in this PR basically just creates a task that sleeps for the duration of the document TTL and then squashes the document history . Any updates cancel the existing task and create a new task. Therefore having a process repeatedly querying the database isn't needed, and thus the timestamp isn't needed for this case. This is how cron is typically implemented.

ellisonbg commented 1 year ago

I think the key question here is this: when is it safe to squash the history?

Here is my understanding (@davidbrochart please add more details or correct any inaccuracies here). When the first user opens a document, the full state of the document is loaded into the memory of the server. As that user makes changes, updates are 1) forwarded to other users, 2) stored in the db, 3) applied to the in memory document on the server. The server is unable to distinguish between users closing the document or network disruptions (both of these appear as network disconnections). The database is used when a client disconnects and reconnects to get them any missing updates. When all clients have been disconnected for a certain amount of time (I think 1m be default) the state of the document is flushed from memory. The db is also used the next time a user opens the document after it is flushed from memory.

The failure mode of squashing history is that we would squash updates that are needed by a user. Let's imagine the following scenario:

N users have the document open and are editing it.
User 1 closes their laptop and goes to lunch at 12 noon local time. This is registered as a client disconnect.
While user 1 is at lunch, other users edit the document and generate new updates.
At 12:59, updates are squashed.
At 1:00, user 1 returns and opens their laptop. At this time, the updates needed by that user no longer exist and have been squashed.

A few questions:

What will actually happen in this case. Is everything smart enough to detect this and still reassemble the current state of the document?
To cover situations like this safely, do we need to run the history squashing during periods of long inactivity. We could determine this by waiting some (long) period of time after all clients disconnect.
Another way we could mitigate this risk is to do more granular history squashing. For example, rather than squashing all the history, we could squash the history into days, and only for days older than 1 week. That way, users would always have the full history for a week, and then daily beyond that.

davidbrochart commented 1 year ago

As that user makes changes, updates are 1) forwarded to other users, 2) stored in the db, 3) applied to the in memory document on the server.

That's correct, just mentioning that 3) is the cause of 1) and 2). We apply changes from the clients to the "server peer", and listen to changes made on this peer to broadcast them to clients and store them.

The database is used when a client disconnects and reconnects to get them any missing updates.

Yes, but only if no client is accessing the document for some time. It there is at least one client accessing the document, the missing updates will be sent to new clients from memory, as the in-memory document would still be present.

At 1:00, user 1 returns and opens their laptop. At this time, the updates needed by that user no longer exist and have been squashed.

Unless the document is still in memory, in which case user 1 will get the full history of changes. Otherwise, yes they will get a document with no history, but I think this is the desired behavior. The document is kept in memory if at least one user is accessing it, and for some time after all clients disconnect from this document.

If we agree on this, I think your questions don't apply anymore, right?

davidbrochart commented 1 year ago

At 1:00, user 1 returns and opens their laptop. At this time, the updates needed by that user no longer exist and have been squashed.

Oh I see your point. Let's assume that updates have been flushed from memory to disk, and squashed.

What will actually happen in this case. Is everything smart enough to detect this and still reassemble the current state of the document?

No, I don't think it will give user 1 their missing updates. I will give them a big update consisting of the whole document, leading to content duplication.

To cover situations like this safely, do we need to run the history squashing during periods of long inactivity. We could determine this by waiting some (long) period of time after all clients disconnect.

Yes, the point being making sure that all clients closed their document in the front-end, i.e. willingly disconnected.

Another way we could mitigate this risk is to do more granular history squashing. For example, rather than squashing all the history, we could squash the history into days, and only for days older than 1 week. That way, users would always have the full history for a week, and then daily beyond that.

True, although this will never eliminate the risk.

Actually the real solution would be to do what Kevin once suggested: introduce the notion of document session. If the user reconnects with a document session, and that document session is not anymore in the back-end, invalidate the shared document in the front-end, and recreate it from the back-end.

dlqqq commented 1 year ago

@davidbrochart Revision summary:

Fixed test flakiness by awaiting the squash task directly in tests rather than doing await asyncio.sleep(ystore.document_ttl + 0.1)
Added test for behavior with a simultaneous write occurring at document TTL
- Not sure how useful this is, given that this tests the execution order of tasks scheduled to run at the same tick of the event loop, which is likely undefined behavior in Python. Right now, it appears that an update written at the document TTL is included in the squash.
Squash updates on init by doing the same timestamp query we were doing before

fperez commented 1 year ago

Thanks for tagging this issue in #14031 @davidbrochart - I am not familiar with all the details, but it seems indeed that some very robust logic for invalidation and full document refresh in the frontend is necessary. Google Docs evidently does that on occasion - I've had it tell me that my document was stale and forced a reload, after refusing any more edits into a doc.

I'm spitballing here, not knowing the implementation details, but it seems to me that any squashing should be done "behind a barrier" - a point in the history that defines a fixed state that updates can't traverse with any granularity. Anyone needing to sync state across that barrier simply needs to re-sync to the state at the barrier, and then get the updates after that.

I'll come to the team meeting tomorrow and hopefully we can discuss this a bit. I was willing to dogfood RTC in production for a bit, but at this point I may need to pull the plug unless we find a viable solution path quickly, as I can't really tolerate any data loss/damage in that environment. It would be a shame to lose the opportunity of this environment for real-world feedback on various aspects of UX and performance, so 🤞 that we can figure this out :)

y-crdt / ypy-websocket

squash doc history in separate task #58

Open concerns