[Platform] Stack trace failure on one of checks in health-check

tylarb commented 3 years ago

This issue occurs on a universe with redis API is not enabled.

Platform is version 2.4.0

42321 2021-01-29 11:31:56,260 INFO: Checking cqlsh works for node []
42322 Traceback (most recent call last):
42323   File "bin/cluster_health.py", line 865, in <module>
42324     main()
42325   File "bin/cluster_health.py", line 833, in main
42326     entries = coordinator.run()
42327 2021-01-29 11:31:56,260 INFO: Checking for fatals on node []
42328   File "bin/cluster_health.py", line 706, in run
42329     check.entry = check.result.get() if check.result else None
42330   File "/usr/lib64/python2.7/multiprocessing/pool.py", line 554, in get
42331 2021-01-29 11:31:56,263 INFO: Checking redis cli works for node []
42332     raise self._value
42333 ValueError: invalid literal for int() with base 10: 'handl'

SergeyPotachev commented 3 years ago

Line 42332 goes after 42330 and belongs to the cqlsh check. After this the redis cli check is started. Here we have slightly mixed lines due to simultaneous output to log from two different threads. But it looks slightly strange if we are talking about a universe with redis API disabled - as we have a direct check for such setting:

                if c.enable_yedis:
                    coordinator.add_check(checker, "check_redis_cli")

Need more logs - in logs above we have a full list of health-check parameters passed to the script. Additionally in logs near the exception usually we can find a string caused the problem (incorrect answer on our cqlsh command).

tylarb commented 3 years ago

Longer logs:

  1 2021-01-29 11:36:54,107 [INFO] from com.yugabyte.yw.commissioner.HealthChecker in application-akka.actor.default-dispatcher-2343 - Doing health check for universe: -Primary
  2 2021-01-29 11:36:54,109 [INFO] from com.yugabyte.yw.common.DevopsBase in application-akka.actor.default-dispatcher-2343 - Command to run: [bin/py_wrapper bin/cluster_health.py --cluster_payload [{"identityFile":"/opt/yugabyte/data/keys/f3304b49-c15e-    4b2b-af1e-7535901124e8/on-premises-dc-key.pem","sshPort":22,"namespaceToConfig":{},"masterNodes":{"[IP].16.38":"yb-dev--Primary-n4","[IP].16.78":"yb-dev--Primary-n3","[IP].16.15":"yb-dev--Primary-n2"},"tserverNodes":{"[IP].16.80":"yb-dev--Primary-n10    ","[IP].16.81":"yb-dev--Primary-n12","[IP].16.38":"yb-dev--Primary-n4","[IP].16.37":"yb-dev--Primary-n1","[IP].16.77":"yb-dev--Primary-n5","[IP].16.78":"yb-dev--Primary-n3","[IP].16.79":"yb-dev--Primary-n9","[IP].16.14":"yb-dev--Primary-n6","[IP].16.    2":"yb-dev--Primary-n7","[IP].16.43":"yb-dev--Primary-n11","[IP].16.15":"yb-dev--Primary-n2","[IP].16.42":"yb-dev--Primary-n8"},"ybSoftwareVersion":"2.4.0.0-b60","enableTlsClient":true,"enableYSQL":false,"ysqlPort":5433,"ycqlPort":9042,"enableYEDIS":    true,"redisPort":6379}] --universe_name -Primary --customer_tag [yw-dev][dev] --destination yugabyteoperations@ --start_time_ms 1611933640267 --send_status --report_only_errors]
  3 2021-01-29 11:36:56,397 [ERROR] from com.yugabyte.yw.commissioner.HealthChecker in application-akka.actor.default-dispatcher-2343 - Health check script got error: code (1)
  4 2021-01-29 11:36:54,234 INFO: Script started at 2021-01-29 08:36:54
  5 2021-01-29 11:36:54,236 INFO: Checking uptime for [IP].16.42 process yb-tserver
  6 2021-01-29 11:36:54,238 INFO: Checking cqlsh works for node [IP].16.42
  7 2021-01-29 11:36:54,238 INFO: Checking for fatals on node [IP].16.42
  8 2021-01-29 11:36:54,238 INFO: Checking redis cli works for node [IP].16.42
  9 2021-01-29 11:36:54,240 INFO: Checking disk utilization on node [IP].16.42
 10 2021-01-29 11:36:54,240 INFO: Checking for core files on node [IP].16.42
 11 2021-01-29 11:36:54,241 INFO: Checking for open file descriptors on node [IP].16.42
 12 2021-01-29 11:36:54,244 INFO: Checking uptime for [IP].16.78 process yb-master
 13 2021-01-29 11:36:54,246 INFO: Checking for fatals on node [IP].16.78
 14 2021-01-29 11:36:54,247 INFO: Checking uptime for [IP].16.78 process yb-tserver
 15 2021-01-29 11:36:55,249 INFO: Checking for fatals on node [IP].16.78
 16 2021-01-29 11:36:55,250 INFO: Checking cqlsh works for node [IP].16.78
 17 2021-01-29 11:36:55,250 INFO: Checking redis cli works for node [IP].16.78
 18 2021-01-29 11:36:55,251 INFO: Checking disk utilization on node [IP].16.78
 19 2021-01-29 11:36:55,252 INFO: Checking for core files on node [IP].16.78
 20 2021-01-29 11:36:55,253 INFO: Checking for open file descriptors on node [IP].16.78
 21 2021-01-29 11:36:55,253 INFO: Checking uptime for [IP].16.79 process yb-tserver
 22 2021-01-29 11:36:55,253 INFO: Checking for fatals on node [IP].16.79
 23 2021-01-29 11:36:55,254 INFO: Checking cqlsh works for node [IP].16.79
 24 2021-01-29 11:36:55,259 INFO: Checking redis cli works for node [IP].16.79
 25 2021-01-29 11:36:56,258 INFO: Checking disk utilization on node [IP].16.79
 26 2021-01-29 11:36:56,258 INFO: Checking for core files on node [IP].16.79
 27 2021-01-29 11:36:56,260 INFO: Checking for open file descriptors on node [IP].16.79
 28 2021-01-29 11:36:56,263 INFO: Checking uptime for [IP].16.81 process yb-tserver
 29 Traceback (most recent call last):
 30   File "bin/cluster_health.py", line 865, in <module>
 31     main()
 32   File "bin/cluster_health.py", line 833, in main
 33     entries = coordinator.run()
 34   File "bin/cluster_health.py", line 706, in run
 35     check.entry = check.result.get() if check.result else None
 36   File "/usr/lib64/python2.7/multiprocessing/pool.py", line 554, in get
 37     raise self._value
 38 ValueError: invalid literal for int() with base 10: 'handl'
 39 2021-01-29 11:36:56,264 INFO: Checking for fatals on node [IP].16.81
 40 2021-01-29 11:36:56,264 INFO: Checking cqlsh works for node [IP].16.81
 41 2021-01-29 11:36:56,397 [INFO] from com.yugabyte.yw.commissioner.HealthChecker in application-akka.actor.default-dispatcher-2343 - Doing health check for universe: -Secondary

SergeyPotachev commented 3 years ago

As the customer doesn't have corresponding emails (with health-check errors), check the health-checks status in YW UI. The interesting moments are: 2021-01-29 11:31:56,260 and 2021-01-29 11:36:56,263. The first goal for us is to understand which one health-check subtask gives us the error.

SergeyPotachev commented 3 years ago

Closed as a duplicate of #7402.

yugabyte / yugabyte-db

[Platform] Stack trace failure on one of checks in health-check #7053