ydb-platform / nbs

Network Block Store
Apache License 2.0
52 stars 21 forks source link

[NBS] DiskAgentIoDuringSecureErase crit event occured #975

Closed gy2411 closed 2 months ago

gy2411 commented 5 months ago

The timeline is the following:

1) DestroyVolume request was sent

2) DestroyVolume request completes

3) Secure erase for one of the devices of this volume starts

4) Several errors like this occured:

13 апр. 13:48:47.113
CRITICAL_EVENT:AppCriticalEvents/DiskAgentIoDuringSecureErase: Device=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx, ClientId=migration, StartIndex=2237440, BlocksCount=1024, IsWrite=0, IsRdma=0

Note that here ClientId=migration and io mode is read.

5) Secure erase for this device starts again and then finishes.

Logs: DiskAgentIoDuringSecureErase_logs.txt

Looks like a bug. Need to find the reason of this and fix it.

qkrorlqr commented 5 months ago

seems that a resync operation was in flight when the erase process started

https://github.com/ydb-platform/nbs/blob/main/cloud/blockstore/libs/storage/service/service_actor_destroy.cpp - this code doesn't ask the volume to stop all operations before destruction so such a race looks possible - if ModifyVolumeResponse comes before VolumeActor is killed, the erase process may start before all inflight volume operations are finished

I see two solutions here:

  1. Simple but a bit dirty: we can delay erase operations by, say, 1 minute (configurable via nbs-storage.txt) after disk deallocation
  2. TDestroyVolumeActor may explicitly ask TVolumeActor to stop all operations before deallocating the disk