y-crdt / ydotnet

.NET bindings for yrs.
MIT License
35 stars 8 forks source link

YDotNet stops persisting to storage after 1st invalidation #89

Closed vdurante closed 4 months ago

vdurante commented 6 months ago

Hey, I noticed some odd behavior when trying to work with Redis storage (I believe this also impacts InMemoryStorage).

During the first 10 seconds, while the cache is valid, I can see several writes to Redis. But then, once the cache gets invalidated and we initialize a new Doc, it stops working.

You can find a working example here: https://github.com/vdurante/ydotnet/tree/redis-write-bug

Using your demo project, this is what I did:

Set storage type in appsettings.Development.json:

"Type": "Redis",

Modify initialization of redis storage:

yDotNet.AddRedisStorage(
    opts =>
    {
        opts.Expiration = _ => TimeSpan.FromMinutes(5);
    });
yDotNet.AddRedis(options =>
{
    options.Configuration =
        ConfigurationOptions.Parse(
            builder.Configuration["Storage:Redis:ConnectionString"]!); // this value is wrong in the demo project. I had to rename to Storage:Redis:ConnectionString
});

Add this to Program.cs:

builder.Services.PostConfigure<DocumentManagerOptions>(
  dmo =>
  {
      dmo.CacheDuration = TimeSpan.FromSeconds(10);
  });

Added the following to StoreDocAsync in RedisDocumentStorage.cs:

Console.WriteLine("size: " + doc.Length);
SebastianStehle commented 6 months ago

I think the problem is somewhere here: https://github.com/y-crdt/ydotnet/blob/main/YDotNet.Server/Internal/DelayedWriter.cs

SebastianStehle commented 6 months ago

I have created a new pull request with a simplify write procedure. This time everything is just protected by locks. I wanted to be clever before and I have found a lot of edge cases where things could potentially fail.

vdurante commented 6 months ago

@SebastianStehle awesome, thanks! I will test this also with my architecture since I also noticed some other issues caused by short-lived documents. Thanks for the help!

vdurante commented 6 months ago

@SebastianStehle thanks for all the help and support you have been given me.

I was able to reproduce one last issue. I noticed the code used for getting the bytes of the document that is being disposed returns the wrong amount of bytes, causing the code to write to the storage an empty document.

This usually gets overwritten by the write operation that happens when doc.ObserveUpdatesV1 event triggers, but if the user idles for a while with the editor open and the documents gets disposed, it causes the storage to have a broken state.

I noticed that you extract the bytes from the document in 2 different ways:

  1. ObserveUpdatesV1 event -> open a read transaction, then write the bytes
  2. on dispose of the document -> open a write transaction, then write the bytes

I am not sure if this solves the issue, but I tried changing the dispose flow to open a read transaction and it seems to write the correct bytes now.

I ams till trying to find a working reproduction and will let you know.

SebastianStehle commented 6 months ago

Awesome. Perhaps bad code duplication :) ... Because we do not write every change immediately, the Observe-call just pings the writer, so it can get the data, but this does not work on a dispose. I just think that a dispose should also force a write.

vdurante commented 6 months ago

Yeah, the read transaction didn't make a difference.

I don't have a precise reproduction, but this is how I was able to reproduce:

  1. Write something to the document -> gets written correctly to the storage (ObserveUpdatesV1)
  2. Let it timeout -> gets written correctly to the storage (dispose)
  3. Write something to the document -> gets written correctly to the storage (ObserveUpdatesV1)
  4. Let it timeout -> gets written correctly to the storage (dispose)
  5. Write something to the document -> apparently ObserveUpdatesV1 stops triggering
  6. Let it timeout -> writes an empty document to the storage (dispose)

I tried to refactor and make all write calls sync, but that didn't solve either.

SebastianStehle commented 6 months ago

Perhaps it is an issue of the cache and the code uses the wrong document. This part might not be stable.

vdurante commented 6 months ago

I tried making all calls sync, so the write from the event doesn't clash with the write from the dispose, but didn't work. I will keep investigating. Thanks for the help.

SebastianStehle commented 6 months ago

I have another simplification...more cases found. I will test it and push it soon.

vdurante commented 6 months ago

Hey everyone, I wanted to share a bit about my investigation and possible workaround I found to this issue. Maybe someone else is facing a similar issue and looking for a workaround.

As it seems, for some reason, the cycle of in-memory -> write to external storage -> reload in-memory -> external storage -> ... seem to cause an eventual corruption of the doc, which then leads to writing an empty document to the external storage.

In my case, since the document may be modified by multiple sources, this is quite problematic and in my scenario it fails to recover successfully. Once the document gets corrupted it writes 2 bytes to the storage, which then gets loaded again by the backend or other applications, and basically triggers a cascading failure that ends in data loss.

Wanted to share 2 possible workarounds I found, which I am investigating and seem quite promising.

My initial attempt was to fully refactor the DocumentWriter code to make it synchronous, but for some reason that doesn't work either. I spent a few days trying to figure out why, but couldn't find the reason.

What I did:

  1. Implemented a a reload function on the backend side. This block of code is responsible for loading the data from the external storage if the document doesn't exist in-memory. I basically implemented a similar function that, even if the document does exist in-memory, fetches the byte array from the external storage and runs a sync operation to the doc. This basically allows any external changes to the doc that my backend code is not aware of to be written to the in-memory doc. I then increased the expiration time of the in-memory doc to be similar to the external storage one. So far in my test this is working quite well. It didn't write any empty document to the external storage. This solution is more consuming of memory tho, since the in-memory docs get an extended lifetime.

  2. Implemented a re-sync on the front-end application every few seconds. This is supported by the default implementation y-websockets, so it is very straightforward. The re-sync interval basically triggers a full sync from the front-end every few seconds. This forces the backend to get updated and restore a valid state in the document even if the external storage doc is corrupted.

I use the term corrupted, but in reality I believe it simply writes an empty document. It is 2 bytes of empty data (not sure what is the binary representation of it).

I will still keep investigating the actual root cause, but unfortunately my limited knowledge of ydotnet and yjs is making it hard to figure this out.

SebastianStehle commented 6 months ago

Thanks a lot. At the moment I am not sure what to do. I think we have to find out the root cause first.

have you tried the new version? DocumentWriter does not exist anymore and the code is much simpler.

vdurante commented 6 months ago

Yeah, I believe you meant DelayedWriter doesn't exist anymore, right? I am using your PR code which replaced it with DocumentWriter.

I think we have to find out the root cause first.

That is what I am investing time on right now. Trying to figure out at which point things break. Quite weird honestly

SebastianStehle commented 6 months ago

Before I tried to make everything super efficient, but I found so many edge cases. The new approach tries to be as simple as possible.

vdurante commented 6 months ago

Ok, I believe I found a working reproduction. Apparently it is related to deleting some character and then disposing the document to the Storage.

https://github.com/vdurante/ydotnet/tree/new-write-demo

If we dispose the document after removing a character, it is unable to reload the document correctly.

SebastianStehle commented 6 months ago

You reference a file from your user folder there. I need this file.

vdurante commented 6 months ago

@SebastianStehle I reproduced entirely on top of your code. The only change I made on my branch was to call a cleanup on every call to simulate some websocket disconnecting and triggering a cleanup that would dispose the document. Besides that I added some logging.

If you want to repro on your side:

  1. Write something on the editor
  2. Delete some characters
  3. Make sure a dispose is triggered so the doc is written to the Storage
  4. Do something so the Doc is loaded from Storage -> this is where you can check the issue. It loads a 2 bytes document only
  5. If the doc gets disposed to Storage again it will write 2 bytes to storage
  6. From this point on it is corrupted and can't recover
SebastianStehle commented 6 months ago

You reference a file from your user folder there. I need this file.

vdurante commented 6 months ago

Oops. I added to the repo. It is in the Demo folder. Demo/Demo. I can double check if it is missing

https://github.com/vdurante/ydotnet/blob/new-write-demo/Demo/Demo

SebastianStehle commented 6 months ago

Oh, but the path is wrong ;)

vdurante commented 6 months ago

@SebastianStehle my bad :() will fix as soon as I get to work in 5mins

SebastianStehle commented 6 months ago

I think I don't see your behavior:

Block 7

This block adds a new character 'a' to the editor

Sending 21 bytes through WebSocket

YDotNetSocketMiddleware.InvokeAsync - Incoming update
DocumentContainer.LoadCoreAsync - Restored 53 bytes from 53 bytes available in Storage
DocumentContainer.GetStateAsync - Getting state of 54 bytes to write to storage due to ObserveUpdatesV1 being called.
DocumentWriter..ctor - Writing 54 bytes to Storage.

See above that only 2 bytes are restored from the Storage.
vdurante commented 6 months ago

@SebastianStehle are you using Redis? Can you send me the entire logs?

====================

Block 1

This block is the sync1 message from the initial sync

Sending 4 bytes through WebSocket

YDotNetSocketMiddleware.InvokeAsync - Incoming sync1
DocumentContainer.LoadCoreAsync - Created new doc since nothing was found in storage.

====================

Block 2

This block is the awareness from the initial sync

Sending 12 bytes through WebSocket

YDotNetSocketMiddleware.InvokeAsync - Incoming awareness

====================

Block 3

This block is the sync2 message from the initial sync

Sending 5 bytes through WebSocket

YDotNetSocketMiddleware.InvokeAsync - Incoming sync2

====================

Block 4

This block adds characters 'a' to the editor

This block is delayed to ensure document is written to Storage by ObserveUpdatesV1.

Sending 56 bytes through WebSocket

YDotNetSocketMiddleware.InvokeAsync - Incoming update
DocumentContainer.GetStateAsync - Getting state of 53 bytes to write to storage due to ObserveUpdatesV1 being called.
DocumentWriter..ctor - Writing 53 bytes to Storage.

====================

Block 5

This block deletes character 'a' from the editor. This looks odd. It deletes 1 character from the editor, but the written bytes is way smaller than the previous block.

This block is delayed to ensure document is written to Storage by ObserveUpdatesV1.

Sending 13 bytes through WebSocket

YDotNetSocketMiddleware.InvokeAsync - Incoming update
DocumentContainer.GetStateAsync - Getting state of 35 bytes to write to storage due to ObserveUpdatesV1 being called.
DocumentWriter..ctor - Writing 35 bytes to Storage.

====================

Block 6

This block is an awareness message

This block is delayed to ensure document is disposed to Storage.

Sending 12 bytes through WebSocket

YDotNetSocketMiddleware.InvokeAsync - Incoming awareness
DocumentContainer.DisposeAsync - Writing 35 bytes to Storage. Snapshot contains 16 bytes
DocumentWriter..ctor - Writing 35 bytes to Storage.

====================

Block 7

This block adds a new character 'a' to the editor

Sending 21 bytes through WebSocket

YDotNetSocketMiddleware.InvokeAsync - Incoming update
DocumentContainer.LoadCoreAsync - Restored 2 bytes from 35 bytes available in Storage

See above that only 2 bytes are restored from the Storage.
SebastianStehle commented 6 months ago

No, in memory.

Redis:


====================

Block 1

This block is the sync1 message from the initial sync

Sending 4 bytes through WebSocket

YDotNetSocketMiddleware.InvokeAsync - Incoming sync1
DocumentContainer.LoadCoreAsync - Created new doc since nothing was found in storage.

====================

Block 2

This block is the awareness from the initial sync

Sending 12 bytes through WebSocket

YDotNetSocketMiddleware.InvokeAsync - Incoming awareness

====================

Block 3

This block is the sync2 message from the initial sync

Sending 5 bytes through WebSocket

YDotNetSocketMiddleware.InvokeAsync - Incoming sync2

====================

Block 4

This block adds characters 'a' to the editor

This block is delayed to ensure document is written to Storage by ObserveUpdatesV1.

Sending 56 bytes through WebSocket

YDotNetSocketMiddleware.InvokeAsync - Incoming update
DocumentContainer.GetStateAsync - Getting state of 53 bytes to write to storage due to ObserveUpdatesV1 being called.
DocumentWriter..ctor - Writing 53 bytes to Storage.

====================

Block 6

This block is an awareness message

This block is delayed to ensure document is disposed to Storage.

Sending 12 bytes through WebSocket

YDotNetSocketMiddleware.InvokeAsync - Incoming awareness
DocumentContainer.DisposeAsync - Writing 53 bytes to Storage. Snapshot contains 8 bytes
DocumentWriter..ctor - Writing 53 bytes to Storage.

====================

Block 7

This block adds a new character 'a' to the editor

Sending 21 bytes through WebSocket

YDotNetSocketMiddleware.InvokeAsync - Incoming update
DocumentContainer.LoadCoreAsync - Restored 53 bytes from 53 bytes available in Storage
DocumentContainer.GetStateAsync - Getting state of 54 bytes to write to storage due to ObserveUpdatesV1 being called.
DocumentWriter..ctor - Writing 54 bytes to Storage.
vdurante commented 6 months ago

@SebastianStehle can you re-comment this block so it runs Block 5?

https://github.com/vdurante/ydotnet/blob/new-write-demo/Demo/Program.cs#L114-L118

And keep this at 2 seconds:

https://github.com/vdurante/ydotnet/blob/new-write-demo/Demo/Program.cs#L25-L30

SebastianStehle commented 6 months ago

Now I see it ... I think it would be super awesome to make a reproducible with just a console app. Perhaps it is a general issue not related to the Server project

vdurante commented 6 months ago

I could try that. Do you have any example I could try to replicate?

SebastianStehle commented 6 months ago

No, sorry...

SebastianStehle commented 6 months ago

I have played around with the code. Perhaps we have just used the wrong methods to generate the updates all the time. I have changed everything to use EncodeStateFromSnapshotV2 and I do not have the issue anymore. I can push it to your repo if you invite me.

vdurante commented 6 months ago

@SebastianStehle thank you. I will try that out and open a PR to ydotnet if that works

vdurante commented 6 months ago

Added you to the repo.

SebastianStehle commented 6 months ago

Pushed

SebastianStehle commented 6 months ago

I have pushed again with the correct method according to the yrs main contributor: https://github.com/y-crdt/y-crdt/issues/400

vdurante commented 6 months ago

Danke schön. Will check tomorrow and let you know! Thanks once again for all the support!

vdurante commented 6 months ago

@SebastianStehle opened a PR into your PR.

SebastianStehle commented 6 months ago

Sorry, I have pushed something again. Hopefully the last change, but it is simpler now.

vdurante commented 6 months ago

@SebastianStehle thanks! I won't have time for it till next week probably because I am flying home from Germany now.