ydb-platform / nbs

Network Block Store
Apache License 2.0
52 stars 21 forks source link

[NBS] disk agent stuck during RdmaServer stop #592

Closed budevg closed 5 months ago

budevg commented 6 months ago

when restarting disk agent through systemctl restart blockstore-disk-agent sometimes the agent become stuck and won't stop gracefully.

Feb 29 13:23:38 host NBS_DISK_AGENT[274733]: 2024-02-29-13-23-38 :BLOCKSTORE_SERVER INFO: cloud/blockstore/libs/disk_agent/bootstrap.cpp:554: Stopped Scheduler
Feb 29 13:23:38 host NBS_DISK_AGENT[274733]: 2024-02-29-13-23-38 :BLOCKSTORE_SERVER INFO: cloud/blockstore/libs/disk_agent/bootstrap.cpp:556: Stopped FileIOService
Feb 29 13:23:38 host NBS_DISK_AGENT[274733]: 2024-02-29T13:23:38.807405Z :BLOCKSTORE_DISK_AGENT INFO: Poisoned
Feb 29 13:23:38 host systemd[1]: Stopping Blockstore Disk Agent...
Feb 29 13:23:38 host NBS_DISK_AGENT[274733]: 2024-02-29T13:23:38.807656Z :TX_PROXY WARN: actor# [50317:7339938168503293722:12] HANDLE TEvClientDestroyed from tablet# 72057594046447617
Feb 29 13:23:38 host NBS_DISK_AGENT[274733]: 2024-02-29T13:23:38.807669Z :BLOCKSTORE_HIVE_PROXY ERROR: Pipe to hive72057594037968897 has been reset
Feb 29 13:23:38 host NBS_DISK_AGENT[274733]: 2024-02-29T13:23:38.807670Z :TX_PROXY WARN: actor# [50317:7339938168503293722:12] HANDLE TEvClientDestroyed from tablet# 72057594046447619
Feb 29 13:23:38 host NBS_DISK_AGENT[274733]: 2024-02-29T13:23:38.807768Z :INTERCONNECT_NETWORK NOTICE: [50317 <-> 16] connection closed by peer
Feb 29 13:23:38 host NBS_DISK_AGENT[274733]: 2024-02-29T13:23:38.808117Z :BS_NODE ERROR: {NW42@node_warden_pipe.cpp:42} Handle(TEvTabletPipe::TEvClientDestroyed) ClientId# [50317:7340686686813851624:279] ServerId# [28:7340686687844567958:3809] TabletId# 72057594037932033 PipeClientId# [50317:7340686686813851624:279]
Feb 29 13:23:38 host NBS_DISK_AGENT[274733]: 2024-02-29T13:23:38.808142Z :TX_PROXY WARN: actor# [50317:7339938168503293722:12] HANDLE TEvClientDestroyed from tablet# 72057594046447620
Feb 29 13:23:38 host NBS_DISK_AGENT[274733]: 2024-02-29T13:23:38.808349Z :INTERCONNECT_NETWORK NOTICE: [50317 <-> 28] connection closed by peer
Feb 29 13:23:38 host NBS_DISK_AGENT[274733]: 2024-02-29T13:23:38.808352Z :INTERCONNECT_NETWORK NOTICE: [50317 <-> 6] connection closed by peer
Feb 29 13:23:38 host NBS_DISK_AGENT[274733]: 2024-02-29T13:23:38.808408Z :INTERCONNECT_NETWORK NOTICE: [50317 <-> 32] connection closed by peer
Feb 29 13:23:38 host NBS_DISK_AGENT[274733]: 2024-02-29T13:23:38.808609Z :TX_PROXY WARN: actor# [50317:7339938168503293722:12] HANDLE TEvClientDestroyed from tablet# 72057594046447618
Feb 29 13:23:38 host NBS_DISK_AGENT[274733]: 2024-02-29T13:23:38.808739Z :INTERCONNECT_NETWORK NOTICE: [50317 <-> 20] connection closed by peer
Feb 29 13:23:38 host NBS_DISK_AGENT[274733]: 2024-02-29T13:23:38.808893Z :INTERCONNECT_NETWORK NOTICE: [50317 <-> 24] connection closed by peer
Feb 29 13:23:38 host NBS_DISK_AGENT[274733]: 2024-02-29T13:23:38.808949Z :INTERCONNECT_NETWORK NOTICE: [50317 <-> 13] connection closed by peer
Feb 29 13:23:38 host NBS_DISK_AGENT[274733]: 2024-02-29T13:23:38.809015Z :INTERCONNECT_NETWORK NOTICE: [50317 <-> 11] connection closed by peer
Feb 29 13:23:39 host NBS_DISK_AGENT[274733]: 2024-02-29-13-23-39 :BLOCKSTORE_SERVER INFO: cloud/blockstore/libs/disk_agent/bootstrap.cpp:557: Stopped ActorSystem
Feb 29 13:25:08 host systemd[1]: blockstore-disk-agent.service: State 'stop-sigterm' timed out. Killing.
Feb 29 13:25:08 host systemd[1]: blockstore-disk-agent.service: Killing process 274733 (blocksto.Main) with signal SIGKILL.
Feb 29 13:25:09 host systemd[1]: blockstore-disk-agent.service: Main process exited, code=killed, status=9/KILL
Feb 29 13:25:09 host systemd[1]: blockstore-disk-agent.service: Failed with result 'timeout'.
Feb 29 13:25:09 host systemd[1]: Stopped Blockstore Disk Agent.
Feb 29 13:25:09 host systemd[1]: blockstore-disk-agent.service: Consumed 1w 1d 17h 12min 480ms CPU time.
Feb 29 13:25:09 host systemd[1]: Starting Blockstore Disk Agent...
budevg commented 6 months ago

The problem was in the order of the destruction.

Stopping ActorSystem will stop disk agent actor. This actor will stop the task queue which is used by RdmaServer to handle incoming requests. https://github.com/ydb-platform/nbs/blob/53dff09c97e178526f88cc318090bf8804737c8b/cloud/blockstore/libs/storage/disk_agent/rdma_target.cpp#L947

At this time RdmaServer can still process incoming requests and it will try to submit them into the task queue that was stopped. https://github.com/ydb-platform/nbs/blob/53dff09c97e178526f88cc318090bf8804737c8b/cloud/blockstore/libs/storage/disk_agent/rdma_target.cpp#L166

This will cause deadlock since the task queue (ThreadPool) was stopped and no thread will be able to process the submitted request https://github.com/ydb-platform/nbs/blob/53dff09c97e178526f88cc318090bf8804737c8b/cloud/storage/core/libs/common/thread_pool.cpp#L264

The solution is to stop listening for new rdma server sessions and to disconnect all existing rdma server session during the stop of disk agent actor. Then it will be safe to stop the rdma server.

budevg commented 5 months ago

solved by #625