Open komarevtsev-d opened 1 month ago
CC @budevg
Проблема в ненадежности всей конструкции. https://github.com/ydb-platform/nbs/blame/main/cloud/blockstore/libs/endpoints_vhost/external_vhost_server.cpp#L298 Если Дочерний процесс не обработает SIGUSR1, то он не пришлет ответ и весь этот бесконечный цикл повиснет на ожидании данных. Для этого достаточно в дочернем процессе не успеть подготовиться к приему сигналов и сигнал будет пропущен.
There is a rare issue with metrics delivery to NBS from the vhost-server, which can occur after the restart of the NBS service. For some reason, ReadStatsImpl loop doesn't send
SIGUSR1
signal. And there were no logs withRead stats error...
, so the loop didn't exit by error. System calls tracing withsudo strace -p <nbs_pid> -e kill -f -tt
showed that the NBS was sending signals to only one of the two vhost-server processes. Other than that, the two disks were working just fine.