[Filestore] Invalid `CollectGarbage` requests to blobstorage.

debnatkh commented 8 months ago

Errors like following stared causing IndexTablet to restart

NFS_SERVER[578573]: 2024-03-05T15:06:36.674043Z :NFS_TABLET ERROR: [f:***][t:***] CollectGarbage failed: SEVERITY_ERROR | FACILITY_KIKIMR | 1 Processed status# ERROR from VDisk# [8200021d:2:0:0:0] incarnationGuid# empty QuorumTracker status# ERROR

Looks like CollectGarbage requests sent by TIndexTablet does not guarantee the increasing order of (gen, step)

Started seeing this error much more often after enabling vhost-side reads on the whole cluster

debnatkh commented 7 months ago

CollectGarbage is executed with commitId = GetCurrentCommitId() = 42
Cleanup is started. It acquires a collect barrier with commitId = 42:

https://github.com/ydb-platform/nbs/blob/0555c7db24e936c1d359ead9087d2e24b9229d84/cloud/filestore/libs/storage/tablet/tablet_actor_cleanup.cpp#L68-L72

https://github.com/ydb-platform/nbs/blob/0555c7db24e936c1d359ead9087d2e24b9229d84/cloud/filestore/libs/storage/tablet/tablet_actor_cleanup.cpp#L100

Before the collect barrier is released on completing the Cleanup transaction, another CollectGarbage is exeсuted. CollectCommitId is selected as follows:

https://github.com/ydb-platform/nbs/blob/0555c7db24e936c1d359ead9087d2e24b9229d84/cloud/filestore/libs/storage/tablet/tablet_state_data.cpp#L935-L939

GarbageQueue.GetCollectCommitId():

https://github.com/ydb-platform/nbs/blob/0555c7db24e936c1d359ead9087d2e24b9229d84/cloud/filestore/libs/storage/tablet/model/garbage_queue.cpp#L224-L232

There is an unreleased collect barrier with commitId = 42, thus CollectCommitId will be equal to 41, which is less than LastCollectCommitId

Generating a new CommitId on the Cleanup execution will solve the issue

debnatkh commented 1 month ago

The main problem is that FlushBytes acquires collect barrier, which is less than the last collect commit id:

Consider that there were the following sequence of writes:

Write(0,       256 KiB, 'a') -> Blob(commitId = 42)
Write(256 KiB, 256 KiB, 'b') -> Blob(commitId = 43)
Write(512 KiB, 1,       'f') -> FreshBytes(commitId = 44)
Write(0,       256 KiB, 'c') -> Blob(commitId = 45)

This will lead to the following file layout: [ccccccc][bbbbbbb][f]

After execution of the CollectGarbage, all three new blobs will get a KeepFlag and the last collect commit id will be equal to 44

CommitId:     41      42        43       44
        Blob(a) Blob(b) FreshBytes Blob(c)
                                      |
                             LastCollectCommitId

After execution of the Cleanup operation, the first blob will be marked as garbage
Let us execute FlushBytes operation, It will acquire collect barrier, equal to the minimal commitId, associated with FreshBlobs: https://github.com/ydb-platform/nbs/blob/836a5162f99b998582dc4b476a212619954bfa22/cloud/filestore/libs/storage/tablet/tablet_actor_flush_bytes.cpp#L669-L671

After this acquisition there will be one barrier, equal to 43:

CommitId:     41      42        43       44
            Blob(a) Blob(b) FreshBytes Blob(c)
                                |         |
                             Barrier  LastCollectCommitId

When the next CollectGarbage operation is to be executed, it will choose 42 as a collectCommitId:

https://github.com/ydb-platform/nbs/blob/836a5162f99b998582dc4b476a212619954bfa22/cloud/filestore/libs/storage/tablet/model/garbage_queue.cpp#L227-L228

After it the CollectGarbage request with one new grabage will be sent, leading to a decrease in collectCommitIds sequence: 42 after 44

debnatkh commented 1 month ago

To reproduce the issue, one can use fio:

fio --name=random-write-test \
    --ioengine=libaio \
    --rw=randwrite \
    --bs=512-4k \
    --size=1G \
    --direct=1 \
    --iodepth=16 \
    --numjobs=4 \
    --offset_increment=512 \
    --do_verify=0 \
    --time_based \
    --runtime=$[120*60*60]

AppCriticalEvents/CollectGarbageError errors after starting afformentioned fio:

AppCriticalEvents/CollectGarbageError errors after deploying fix #1919:

debnatkh commented 1 month ago

AppCriticalEvents/CollectGarbageError after release to our production cluster:

ydb-platform / nbs

[Filestore] Invalid `CollectGarbage` requests to blobstorage. #652