Closed budevg closed 5 months ago
The problem was in the order of the destruction.
Stopping ActorSystem will stop disk agent actor. This actor will stop the task queue which is used by RdmaServer to handle incoming requests. https://github.com/ydb-platform/nbs/blob/53dff09c97e178526f88cc318090bf8804737c8b/cloud/blockstore/libs/storage/disk_agent/rdma_target.cpp#L947
At this time RdmaServer can still process incoming requests and it will try to submit them into the task queue that was stopped. https://github.com/ydb-platform/nbs/blob/53dff09c97e178526f88cc318090bf8804737c8b/cloud/blockstore/libs/storage/disk_agent/rdma_target.cpp#L166
This will cause deadlock since the task queue (ThreadPool) was stopped and no thread will be able to process the submitted request https://github.com/ydb-platform/nbs/blob/53dff09c97e178526f88cc318090bf8804737c8b/cloud/storage/core/libs/common/thread_pool.cpp#L264
The solution is to stop listening for new rdma server sessions and to disconnect all existing rdma server session during the stop of disk agent actor. Then it will be safe to stop the rdma server.
solved by #625
when restarting disk agent through
systemctl restart blockstore-disk-agent
sometimes the agent become stuck and won't stop gracefully.