Closed vdurante closed 4 months ago
I think the problem is somewhere here: https://github.com/y-crdt/ydotnet/blob/main/YDotNet.Server/Internal/DelayedWriter.cs
I have created a new pull request with a simplify write procedure. This time everything is just protected by locks. I wanted to be clever before and I have found a lot of edge cases where things could potentially fail.
@SebastianStehle awesome, thanks! I will test this also with my architecture since I also noticed some other issues caused by short-lived documents. Thanks for the help!
@SebastianStehle thanks for all the help and support you have been given me.
I was able to reproduce one last issue. I noticed the code used for getting the bytes of the document that is being disposed returns the wrong amount of bytes, causing the code to write to the storage an empty document.
This usually gets overwritten by the write operation that happens when doc.ObserveUpdatesV1 event triggers, but if the user idles for a while with the editor open and the documents gets disposed, it causes the storage to have a broken state.
I noticed that you extract the bytes from the document in 2 different ways:
I am not sure if this solves the issue, but I tried changing the dispose flow to open a read transaction and it seems to write the correct bytes now.
I ams till trying to find a working reproduction and will let you know.
Awesome. Perhaps bad code duplication :) ... Because we do not write every change immediately, the Observe-call just pings the writer, so it can get the data, but this does not work on a dispose. I just think that a dispose should also force a write.
Yeah, the read transaction didn't make a difference.
I don't have a precise reproduction, but this is how I was able to reproduce:
I tried to refactor and make all write calls sync, but that didn't solve either.
Perhaps it is an issue of the cache and the code uses the wrong document. This part might not be stable.
I tried making all calls sync, so the write from the event doesn't clash with the write from the dispose, but didn't work. I will keep investigating. Thanks for the help.
I have another simplification...more cases found. I will test it and push it soon.
Hey everyone, I wanted to share a bit about my investigation and possible workaround I found to this issue. Maybe someone else is facing a similar issue and looking for a workaround.
As it seems, for some reason, the cycle of in-memory -> write to external storage -> reload in-memory -> external storage -> ... seem to cause an eventual corruption of the doc, which then leads to writing an empty document to the external storage.
In my case, since the document may be modified by multiple sources, this is quite problematic and in my scenario it fails to recover successfully. Once the document gets corrupted it writes 2 bytes to the storage, which then gets loaded again by the backend or other applications, and basically triggers a cascading failure that ends in data loss.
Wanted to share 2 possible workarounds I found, which I am investigating and seem quite promising.
My initial attempt was to fully refactor the DocumentWriter code to make it synchronous, but for some reason that doesn't work either. I spent a few days trying to figure out why, but couldn't find the reason.
What I did:
Implemented a a reload function on the backend side. This block of code is responsible for loading the data from the external storage if the document doesn't exist in-memory. I basically implemented a similar function that, even if the document does exist in-memory, fetches the byte array from the external storage and runs a sync operation to the doc. This basically allows any external changes to the doc that my backend code is not aware of to be written to the in-memory doc. I then increased the expiration time of the in-memory doc to be similar to the external storage one. So far in my test this is working quite well. It didn't write any empty document to the external storage. This solution is more consuming of memory tho, since the in-memory docs get an extended lifetime.
Implemented a re-sync on the front-end application every few seconds. This is supported by the default implementation y-websockets, so it is very straightforward. The re-sync interval basically triggers a full sync from the front-end every few seconds. This forces the backend to get updated and restore a valid state in the document even if the external storage doc is corrupted.
I use the term corrupted, but in reality I believe it simply writes an empty document. It is 2 bytes of empty data (not sure what is the binary representation of it).
I will still keep investigating the actual root cause, but unfortunately my limited knowledge of ydotnet and yjs is making it hard to figure this out.
Thanks a lot. At the moment I am not sure what to do. I think we have to find out the root cause first.
have you tried the new version? DocumentWriter does not exist anymore and the code is much simpler.
Yeah, I believe you meant DelayedWriter doesn't exist anymore, right? I am using your PR code which replaced it with DocumentWriter.
I think we have to find out the root cause first.
That is what I am investing time on right now. Trying to figure out at which point things break. Quite weird honestly
Before I tried to make everything super efficient, but I found so many edge cases. The new approach tries to be as simple as possible.
Ok, I believe I found a working reproduction. Apparently it is related to deleting some character and then disposing the document to the Storage.
https://github.com/vdurante/ydotnet/tree/new-write-demo
If we dispose the document after removing a character, it is unable to reload the document correctly.
You reference a file from your user folder there. I need this file.
@SebastianStehle I reproduced entirely on top of your code. The only change I made on my branch was to call a cleanup on every call to simulate some websocket disconnecting and triggering a cleanup that would dispose the document. Besides that I added some logging.
If you want to repro on your side:
You reference a file from your user folder there. I need this file.
Oops. I added to the repo. It is in the Demo folder. Demo/Demo. I can double check if it is missing
https://github.com/vdurante/ydotnet/blob/new-write-demo/Demo/Demo
Oh, but the path is wrong ;)
@SebastianStehle my bad :() will fix as soon as I get to work in 5mins
I think I don't see your behavior:
Block 7
This block adds a new character 'a' to the editor
Sending 21 bytes through WebSocket
YDotNetSocketMiddleware.InvokeAsync - Incoming update
DocumentContainer.LoadCoreAsync - Restored 53 bytes from 53 bytes available in Storage
DocumentContainer.GetStateAsync - Getting state of 54 bytes to write to storage due to ObserveUpdatesV1 being called.
DocumentWriter..ctor - Writing 54 bytes to Storage.
See above that only 2 bytes are restored from the Storage.
@SebastianStehle are you using Redis? Can you send me the entire logs?
====================
Block 1
This block is the sync1 message from the initial sync
Sending 4 bytes through WebSocket
YDotNetSocketMiddleware.InvokeAsync - Incoming sync1
DocumentContainer.LoadCoreAsync - Created new doc since nothing was found in storage.
====================
Block 2
This block is the awareness from the initial sync
Sending 12 bytes through WebSocket
YDotNetSocketMiddleware.InvokeAsync - Incoming awareness
====================
Block 3
This block is the sync2 message from the initial sync
Sending 5 bytes through WebSocket
YDotNetSocketMiddleware.InvokeAsync - Incoming sync2
====================
Block 4
This block adds characters 'a' to the editor
This block is delayed to ensure document is written to Storage by ObserveUpdatesV1.
Sending 56 bytes through WebSocket
YDotNetSocketMiddleware.InvokeAsync - Incoming update
DocumentContainer.GetStateAsync - Getting state of 53 bytes to write to storage due to ObserveUpdatesV1 being called.
DocumentWriter..ctor - Writing 53 bytes to Storage.
====================
Block 5
This block deletes character 'a' from the editor. This looks odd. It deletes 1 character from the editor, but the written bytes is way smaller than the previous block.
This block is delayed to ensure document is written to Storage by ObserveUpdatesV1.
Sending 13 bytes through WebSocket
YDotNetSocketMiddleware.InvokeAsync - Incoming update
DocumentContainer.GetStateAsync - Getting state of 35 bytes to write to storage due to ObserveUpdatesV1 being called.
DocumentWriter..ctor - Writing 35 bytes to Storage.
====================
Block 6
This block is an awareness message
This block is delayed to ensure document is disposed to Storage.
Sending 12 bytes through WebSocket
YDotNetSocketMiddleware.InvokeAsync - Incoming awareness
DocumentContainer.DisposeAsync - Writing 35 bytes to Storage. Snapshot contains 16 bytes
DocumentWriter..ctor - Writing 35 bytes to Storage.
====================
Block 7
This block adds a new character 'a' to the editor
Sending 21 bytes through WebSocket
YDotNetSocketMiddleware.InvokeAsync - Incoming update
DocumentContainer.LoadCoreAsync - Restored 2 bytes from 35 bytes available in Storage
See above that only 2 bytes are restored from the Storage.
No, in memory.
Redis:
====================
Block 1
This block is the sync1 message from the initial sync
Sending 4 bytes through WebSocket
YDotNetSocketMiddleware.InvokeAsync - Incoming sync1
DocumentContainer.LoadCoreAsync - Created new doc since nothing was found in storage.
====================
Block 2
This block is the awareness from the initial sync
Sending 12 bytes through WebSocket
YDotNetSocketMiddleware.InvokeAsync - Incoming awareness
====================
Block 3
This block is the sync2 message from the initial sync
Sending 5 bytes through WebSocket
YDotNetSocketMiddleware.InvokeAsync - Incoming sync2
====================
Block 4
This block adds characters 'a' to the editor
This block is delayed to ensure document is written to Storage by ObserveUpdatesV1.
Sending 56 bytes through WebSocket
YDotNetSocketMiddleware.InvokeAsync - Incoming update
DocumentContainer.GetStateAsync - Getting state of 53 bytes to write to storage due to ObserveUpdatesV1 being called.
DocumentWriter..ctor - Writing 53 bytes to Storage.
====================
Block 6
This block is an awareness message
This block is delayed to ensure document is disposed to Storage.
Sending 12 bytes through WebSocket
YDotNetSocketMiddleware.InvokeAsync - Incoming awareness
DocumentContainer.DisposeAsync - Writing 53 bytes to Storage. Snapshot contains 8 bytes
DocumentWriter..ctor - Writing 53 bytes to Storage.
====================
Block 7
This block adds a new character 'a' to the editor
Sending 21 bytes through WebSocket
YDotNetSocketMiddleware.InvokeAsync - Incoming update
DocumentContainer.LoadCoreAsync - Restored 53 bytes from 53 bytes available in Storage
DocumentContainer.GetStateAsync - Getting state of 54 bytes to write to storage due to ObserveUpdatesV1 being called.
DocumentWriter..ctor - Writing 54 bytes to Storage.
@SebastianStehle can you re-comment this block so it runs Block 5?
https://github.com/vdurante/ydotnet/blob/new-write-demo/Demo/Program.cs#L114-L118
And keep this at 2 seconds:
https://github.com/vdurante/ydotnet/blob/new-write-demo/Demo/Program.cs#L25-L30
Now I see it ... I think it would be super awesome to make a reproducible with just a console app. Perhaps it is a general issue not related to the Server project
I could try that. Do you have any example I could try to replicate?
No, sorry...
I have played around with the code. Perhaps we have just used the wrong methods to generate the updates all the time. I have changed everything to use EncodeStateFromSnapshotV2
and I do not have the issue anymore. I can push it to your repo if you invite me.
@SebastianStehle thank you. I will try that out and open a PR to ydotnet if that works
Added you to the repo.
Pushed
I have pushed again with the correct method according to the yrs main contributor: https://github.com/y-crdt/y-crdt/issues/400
Danke schön. Will check tomorrow and let you know! Thanks once again for all the support!
@SebastianStehle opened a PR into your PR.
Sorry, I have pushed something again. Hopefully the last change, but it is simpler now.
@SebastianStehle thanks! I won't have time for it till next week probably because I am flying home from Germany now.
Hey, I noticed some odd behavior when trying to work with Redis storage (I believe this also impacts InMemoryStorage).
During the first 10 seconds, while the cache is valid, I can see several writes to Redis. But then, once the cache gets invalidated and we initialize a new Doc, it stops working.
You can find a working example here: https://github.com/vdurante/ydotnet/tree/redis-write-bug
Using your demo project, this is what I did:
Set storage type in appsettings.Development.json:
Modify initialization of redis storage:
Add this to Program.cs:
Added the following to StoreDocAsync in RedisDocumentStorage.cs: