ydb-platform / nbs

Network Block & File Store
Apache License 2.0
58 stars 22 forks source link

[Filestore] Invalid `CollectGarbage` requests to blobstorage. #652

Closed debnatkh closed 1 month ago

debnatkh commented 8 months ago

Errors like following stared causing IndexTablet to restart

NFS_SERVER[578573]: 2024-03-05T15:06:36.674043Z :NFS_TABLET ERROR: [f:***][t:***] CollectGarbage failed: SEVERITY_ERROR | FACILITY_KIKIMR | 1 Processed status# ERROR from VDisk# [8200021d:2:0:0:0] incarnationGuid# empty QuorumTracker status# ERROR

Looks like CollectGarbage requests sent by TIndexTablet does not guarantee the increasing order of (gen, step)

Started seeing this error much more often after enabling vhost-side reads on the whole cluster

debnatkh commented 7 months ago
  1. CollectGarbage is executed with commitId = GetCurrentCommitId() = 42
  2. Cleanup is started. It acquires a collect barrier with commitId = 42:

https://github.com/ydb-platform/nbs/blob/0555c7db24e936c1d359ead9087d2e24b9229d84/cloud/filestore/libs/storage/tablet/tablet_actor_cleanup.cpp#L68-L72

https://github.com/ydb-platform/nbs/blob/0555c7db24e936c1d359ead9087d2e24b9229d84/cloud/filestore/libs/storage/tablet/tablet_actor_cleanup.cpp#L100

  1. Before the collect barrier is released on completing the Cleanup transaction, another CollectGarbage is exeсuted. CollectCommitId is selected as follows:

https://github.com/ydb-platform/nbs/blob/0555c7db24e936c1d359ead9087d2e24b9229d84/cloud/filestore/libs/storage/tablet/tablet_state_data.cpp#L935-L939

  1. GarbageQueue.GetCollectCommitId():

https://github.com/ydb-platform/nbs/blob/0555c7db24e936c1d359ead9087d2e24b9229d84/cloud/filestore/libs/storage/tablet/model/garbage_queue.cpp#L224-L232

  1. There is an unreleased collect barrier with commitId = 42, thus CollectCommitId will be equal to 41, which is less than LastCollectCommitId

Generating a new CommitId on the Cleanup execution will solve the issue

debnatkh commented 1 month ago

The main problem is that FlushBytes acquires collect barrier, which is less than the last collect commit id:


  1. Consider that there were the following sequence of writes:

    Write(0,       256 KiB, 'a') -> Blob(commitId = 42)
    Write(256 KiB, 256 KiB, 'b') -> Blob(commitId = 43)
    Write(512 KiB, 1,       'f') -> FreshBytes(commitId = 44)
    Write(0,       256 KiB, 'c') -> Blob(commitId = 45)

    This will lead to the following file layout: [ccccccc][bbbbbbb][f]

  2. After execution of the CollectGarbage, all three new blobs will get a KeepFlag and the last collect commit id will be equal to 44

    CommitId:     41      42        43       44
            Blob(a) Blob(b) FreshBytes Blob(c)
                                          |
                                 LastCollectCommitId
  3. After execution of the Cleanup operation, the first blob will be marked as garbage

  4. Let us execute FlushBytes operation, It will acquire collect barrier, equal to the minimal commitId, associated with FreshBlobs: https://github.com/ydb-platform/nbs/blob/836a5162f99b998582dc4b476a212619954bfa22/cloud/filestore/libs/storage/tablet/tablet_actor_flush_bytes.cpp#L669-L671

After this acquisition there will be one barrier, equal to 43:

CommitId:     41      42        43       44
            Blob(a) Blob(b) FreshBytes Blob(c)
                                |         |
                             Barrier  LastCollectCommitId
  1. When the next CollectGarbage operation is to be executed, it will choose 42 as a collectCommitId:

https://github.com/ydb-platform/nbs/blob/836a5162f99b998582dc4b476a212619954bfa22/cloud/filestore/libs/storage/tablet/model/garbage_queue.cpp#L227-L228

After it the CollectGarbage request with one new grabage will be sent, leading to a decrease in collectCommitIds sequence: 42 after 44

debnatkh commented 1 month ago

To reproduce the issue, one can use fio:

fio --name=random-write-test \
    --ioengine=libaio \
    --rw=randwrite \
    --bs=512-4k \
    --size=1G \
    --direct=1 \
    --iodepth=16 \
    --numjobs=4 \
    --offset_increment=512 \
    --do_verify=0 \
    --time_based \
    --runtime=$[120*60*60]

AppCriticalEvents/CollectGarbageError errors after starting afformentioned fio:

image

AppCriticalEvents/CollectGarbageError errors after deploying fix #1919:

image
debnatkh commented 1 month ago

AppCriticalEvents/CollectGarbageError after release to our production cluster:

image