yugabyte / yugabyte-db

YugabyteDB - the cloud native distributed SQL database for mission-critical applications.
https://www.yugabyte.com
Other
8.91k stars 1.06k forks source link

[Platform] Health checks can overlap with universe update operations started after them #7738

Open iSignal opened 3 years ago

iSignal commented 3 years ago

@SergeyPotachev @daniel-yb

We had a (rare) example of this on portal where a health check got kicked off, followed by an edit operation on the universe. The health check then reported issues for nodes taken down during the edit (full move).

2021-03-19 20:11:13,056 [DEBUG] from ShellProcessHandler in application-akka.actor.default-dispatcher-11 - Starting proc (full cmd) - 'bin/py_wrapper' 'bin/c followed by 2021-03-19 20:11:13,056 [DEBUG] from ShellProcessHandler in application-akka.actor.default-dispatcher-11 - Starting proc (full cmd) - 'bin/py_wrapper' 'bin/c followed by 2021-03-19 20:13:49,393 [INFO] from ShellProcessHandler in application-akka.actor.default-dispatcher-11 - Completed proc 'bin/py_wrapper bin/cluster_health.py' status=success [ 156337 ms ] 2021-03-19 20:13:49,393 [INFO] from HealthChecker in application-akka.actor.default-dispatcher-11 - Health check for universe nm-test-1 reported errors. [ 156338 ms ]

daniel-yb commented 3 years ago

now that we only return a report for the universe from cluster_health.py and then send the email through the Java layer, maybe we can add an extra check in between when cluster_health.py exits and when we send the emails to verify the updateInProgress == false, otherwise don't report the results of the health check.