ydb-platform / nbs

Network Block Store
Apache License 2.0
50 stars 14 forks source link

[Filestore] Invalid `CollectGarbage` requests to blobstorage. #652

Open debnatkh opened 4 months ago

debnatkh commented 4 months ago

Errors like following stared causing IndexTablet to restart

NFS_SERVER[578573]: 2024-03-05T15:06:36.674043Z :NFS_TABLET ERROR: [f:***][t:***] CollectGarbage failed: SEVERITY_ERROR | FACILITY_KIKIMR | 1 Processed status# ERROR from VDisk# [8200021d:2:0:0:0] incarnationGuid# empty QuorumTracker status# ERROR

Looks like CollectGarbage requests sent by TIndexTablet does not guarantee the increasing order of (gen, step)

Started seeing this error much more often after enabling vhost-side reads on the whole cluster

debnatkh commented 3 months ago
  1. CollectGarbage is executed with commitId = GetCurrentCommitId() = 42
  2. Cleanup is started. It acquires a collect barrier with commitId = 42:

https://github.com/ydb-platform/nbs/blob/0555c7db24e936c1d359ead9087d2e24b9229d84/cloud/filestore/libs/storage/tablet/tablet_actor_cleanup.cpp#L68-L72

https://github.com/ydb-platform/nbs/blob/0555c7db24e936c1d359ead9087d2e24b9229d84/cloud/filestore/libs/storage/tablet/tablet_actor_cleanup.cpp#L100

  1. Before the collect barrier is released on completing the Cleanup transaction, another CollectGarbage is exeсuted. CollectCommitId is selected as follows:

https://github.com/ydb-platform/nbs/blob/0555c7db24e936c1d359ead9087d2e24b9229d84/cloud/filestore/libs/storage/tablet/tablet_state_data.cpp#L935-L939

  1. GarbageQueue.GetCollectCommitId():

https://github.com/ydb-platform/nbs/blob/0555c7db24e936c1d359ead9087d2e24b9229d84/cloud/filestore/libs/storage/tablet/model/garbage_queue.cpp#L224-L232

  1. There is an unreleased collect barrier with commitId = 42, thus CollectCommitId will be equal to 41, which is less than LastCollectCommitId

Generating a new CommitId on the Cleanup execution will solve the issue