Closed debnatkh closed 1 month ago
CollectGarbage
is executed with commitId = GetCurrentCommitId() = 42
Cleanup
is started. It acquires a collect barrier with commitId = 42
:Cleanup
transaction, another CollectGarbage
is exeсuted. CollectCommitId
is selected as follows:GarbageQueue.GetCollectCommitId()
:commitId = 42
, thus CollectCommitId will be equal to 41
, which is less than LastCollectCommitId
Generating a new CommitId on the Cleanup
execution will solve the issue
The main problem is that FlushBytes acquires collect barrier, which is less than the last collect commit id:
Consider that there were the following sequence of writes:
Write(0, 256 KiB, 'a') -> Blob(commitId = 42)
Write(256 KiB, 256 KiB, 'b') -> Blob(commitId = 43)
Write(512 KiB, 1, 'f') -> FreshBytes(commitId = 44)
Write(0, 256 KiB, 'c') -> Blob(commitId = 45)
This will lead to the following file layout: [ccccccc][bbbbbbb][f]
After execution of the CollectGarbage, all three new blobs will get a KeepFlag and the last collect commit id will be equal to 44
CommitId: 41 42 43 44
Blob(a) Blob(b) FreshBytes Blob(c)
|
LastCollectCommitId
After execution of the Cleanup operation, the first blob will be marked as garbage
Let us execute FlushBytes operation, It will acquire collect barrier, equal to the minimal commitId, associated with FreshBlobs: https://github.com/ydb-platform/nbs/blob/836a5162f99b998582dc4b476a212619954bfa22/cloud/filestore/libs/storage/tablet/tablet_actor_flush_bytes.cpp#L669-L671
After this acquisition there will be one barrier, equal to 43:
CommitId: 41 42 43 44
Blob(a) Blob(b) FreshBytes Blob(c)
| |
Barrier LastCollectCommitId
After it the CollectGarbage request with one new grabage will be sent, leading to a decrease in collectCommitIds sequence: 42 after 44
To reproduce the issue, one can use fio:
fio --name=random-write-test \
--ioengine=libaio \
--rw=randwrite \
--bs=512-4k \
--size=1G \
--direct=1 \
--iodepth=16 \
--numjobs=4 \
--offset_increment=512 \
--do_verify=0 \
--time_based \
--runtime=$[120*60*60]
AppCriticalEvents/CollectGarbageError errors after starting afformentioned fio:
AppCriticalEvents/CollectGarbageError errors after deploying fix #1919:
AppCriticalEvents/CollectGarbageError
after release to our production cluster:
Errors like following stared causing IndexTablet to restart
Looks like
CollectGarbage
requests sent byTIndexTablet
does not guarantee the increasing order of (gen, step)Started seeing this error much more often after enabling vhost-side reads on the whole cluster