ydb-platform / nbs

Network Block Store
Apache License 2.0
50 stars 14 forks source link

[NBS] Losing local disks metrics after nbs restart #1242

Open komarevtsev-d opened 1 month ago

komarevtsev-d commented 1 month ago

There is a rare issue with metrics delivery to NBS from the vhost-server, which can occur after the restart of the NBS service. For some reason, ReadStatsImpl loop doesn't send SIGUSR1 signal. And there were no logs with Read stats error..., so the loop didn't exit by error. System calls tracing with sudo strace -p <nbs_pid> -e kill -f -tt showed that the NBS was sending signals to only one of the two vhost-server processes. Other than that, the two disks were working just fine.

komarevtsev-d commented 1 month ago

CC @budevg

drbasic commented 6 days ago

Проблема в ненадежности всей конструкции. https://github.com/ydb-platform/nbs/blame/main/cloud/blockstore/libs/endpoints_vhost/external_vhost_server.cpp#L298 Если Дочерний процесс не обработает SIGUSR1, то он не пришлет ответ и весь этот бесконечный цикл повиснет на ожидании данных. Для этого достаточно в дочернем процессе не успеть подготовиться к приему сигналов и сигнал будет пропущен.