zwave-js / node-zwave-js

Z-Wave driver written entirely in JavaScript/TypeScript
https://zwave-js.github.io/node-zwave-js/
MIT License
755 stars 608 forks source link

Node health checks fail to complete all runs most of the time for run counts other than 5 or 1 #5451

Open ColColonCleaner opened 1 year ago

ColColonCleaner commented 1 year ago

Checklist

Deploy method

Docker

Z-Wave JS UI version

8.7.0

ZwaveJS version

10.3.1

Describe the bug

The network graph's 'node health check' feature fails to complete its runs a good portion of the time when a run count other than 5 or 1 are used. For run counts of 5 and 1 it works most of the time and completes all runs, but for other values it cancels part way through most of the time and I'm not sure why. For values of 9 or higher it almost never starts. I got it to work once with a run count of 10, out of many attempts.

To Reproduce

Run node health checks with run counts above 10 for the quickest reproduction, but any values other than 5 or 1 I've noticed have a large chance of failing to complete all runs or sometimes any runs at all.

Expected behavior

Expected the health checks to complete for the number of runs entered.

Additional context

Here is a video showcasing the issue: https://www.youtube.com/watch?v=PGocFe9YRCE

ColColonCleaner commented 1 year ago

Forgot to upload the ZUI log. Here it is: z-ui_2023-01-24.log

ColColonCleaner commented 1 year ago

Think I found the reason. Exceptions while calling a node are not caught, and they also aren't displayed on the UI when the 'network graph' page is open. Setting this up with configs I expected to fail part-way through, i switched back to the main dashboard and that's where the exception was displayed when the health check broke. image Seems I'm having some decode issues. But since exceptions (just this specific one perhaps) aren't caught or displayed on the page I didn't realize it was due to an error until now.

Is it reasonable to expect that the health check would mark the ping as failed and perhaps list the unique reasons they failed at the end instead of silently cancelling the health check?

robertsLando commented 1 year ago

Apart from the UI issue not showing errors, @AlCalzone any clues?

AlCalzone commented 1 year ago

Please make a driver log, loglevel debug and attach it here as a file. This should give us a better idea why it is failing.

ColColonCleaner commented 1 year ago

I ran health checks with varying quantities of runs and varying nodes. Logs attached. z-ui_2023-01-25.log zwavejs_2023-01-25.log

AlCalzone commented 1 year ago

Ok I think I've identified a potential issue. Not sure if my planned fix is correct, because the command sequence doesn't really add up. https://github.com/zwave-js/node-zwave-js/issues/5366

ColColonCleaner commented 1 year ago

Based on what you described in that issue this may be a linchpin for the stability of my network because I see dropped commands/messages a lot, along with the decode failure message. Or is the decode failure a separate problem entirely?

AlCalzone commented 1 year ago

Not sure if that's your actual issue, but the sequence here is as follows:

that second thing is what I find weird. In any case we should probably just ignore it in that situation, because it causes the outgoing command to be re-transmitted immediately, leading to another "I didn't understand that".

ColColonCleaner commented 1 year ago

Interesting. If you have a way of pushing out a beta version with that change I'd be happy to try it and send logs.