[Investigate 700 Bug] Nodes being marked dead; may be specific to Inovelli LZW31 devices

justindthomas commented 2 years ago

Is your problem within Home Assistant (Core or Z-Wave JS Integration)?

NO, my problem is NOT within Home Assistant or the ZWave JS integration

Is your problem within ZWaveJS2MQTT?

NO, my problem is NOT within ZWaveJS2MQTT

Checklist

[X] I have checked the troubleshooting section and my problem is not described there.
[X] I have read the changelog and my problem was not mentioned there.

Describe the bug

I upgraded my Aeotec Z-Stick 7 yesterday. This morning, I noticed one of my automations did not trigger a light switch as it should have (a ZEN32). The logs indicate that a LZW31 device had a sudden burst of traffic and then the ZEN32 device was marked dead.

The death of node 85 (ZEN32) with 10 minutes for context: dead_node_85.log

The full activity of node 42 (LZW31) in that log file: node_42.log

Per the recommendation of @blhoward2, I ran a network heal. That was mostly successful with 68/69 nodes healing successfully. One did not heal - that was a different LZW31 (node 55): node_55.log

Another LZW31 was marked dead. node_31.log

I have 9 LZW31 devices (out of 69 total Zwave nodes). All are running firmware 1.57.

Device information

Manufacturer: Inovelli Model name: LZW31 Node ID in your network: 31, 42, 55

How are you using `node-zwave-js`?

[ ] zwavejs2mqtt Docker image (latest)
[ ] zwavejs2mqtt Docker image (dev)
[ ] zwavejs2mqtt Docker manually built (please specify branches)
[ ] ioBroker.zwave2 adapter (please specify version)
[ ] HomeAssistant zwave_js integration (please specify version)
[ ] pkg
[ ] node-red-contrib-zwave-js (please specify version, double click node to find out)
[X] Manually built from GitHub (please specify branch)
[ ] Other (please describe)

Which branches or versions?

version: node-zwave-js branch: master (8.11.1) zwavejs2mqtt branch: master (6.4.1)

Did you change anything?

no

If yes, what did you change?

No response

Did this work before?

Yes (please describe)

If yes, where did it work?

No problems on the Aeotec Z-Stick 5+ for the last couple of months.

Attach Driver Logfile

zwavejs_current.log

blhoward2 commented 2 years ago

Do you have a zniffer available? Or a 500-stick lying around you're willing to permanently convert to one?

justindthomas commented 2 years ago

I don't have a zniffer, but I do have a Zooz ZST10 that I've never opened. I'd be fine with converting that.

‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐ On Saturday, January 29th, 2022 at 11:51 AM, blhoward2 @.***> wrote:

Do you have a zniffer available? Or a 500-stick lying around you're willing to permanently convert to one?

— Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android. You are receiving this because you authored the thread.Message ID: @.***>

blhoward2 commented 2 years ago

It would greatly help in diagnosing this. You need windows available and the process is a bit involved.

justindthomas commented 2 years ago

Sure, just let me know what you need me to do. I avoid Windows when I can, but I can use one of our gaming PCs to get it done.

Sent from ProtonMail mobile

-------- Original Message -------- On Jan 29, 2022, 12:01 PM, blhoward2 wrote:

It would greatly help in diagnosing this. You need windows available and the process is a bit involved.

— Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android. You are receiving this because you authored the thread.Message ID: @.***>

blhoward2 commented 2 years ago

@kpine Did you ever make a zniffer conversion guide?

The instructions are generally here: https://www.silabs.com/documents/public/user-guides/INS10249-Z-Wave-Zniffer-User-Guide.pdf. I did find the fw with the help of SiLabs. It is now located on GitHub in a stupid place: https://github.com/SiliconLabs/gecko_sdk/releases Under demo_applications.zip, under the z-wave protocol folder.

kpine commented 2 years ago

No, I don't have a guide, but the Silicon Labs guide is here. It assumes the target controller uses the same chip as the UZB3, not sure if that's true with the Zooz or not. I can say my Aeotec Z-Stick was a terrible Zniffer, night and day difference between it and the UZB3.

I don't see any 500-series zniffer firmware in the demo_applications.zip file. Do you know which file it's supposed to be? All I see are 700-series fw which only work with their dev board.

blhoward2 commented 2 years ago

It’s under the zwave protocol folder, or at least it was two weeks ago. I'm actually having trouble finding it now but I know it was there, I went back to the email they sent me.

justindthomas commented 2 years ago

Strangely, lsusb reports the Zooz sick under the Aeotec description:

  idVendor           0x0658 Sigma Designs, Inc.
  idProduct          0x0200 Aeotec Z-Stick Gen5 (ZW090) - UZB

blhoward2 commented 2 years ago

It does for my reference UZB3 from SiLabs too

kpine commented 2 years ago

I'm actually having trouble finding it now but I know it was there, I went back to the email they sent me.

What was the filename they told you?

blhoward2 commented 2 years ago

sniffer_ZW050x_USBVCP.hex

I successfully converted a UZB3 at the time so clearly I found it somewhere there.

kpine commented 2 years ago

Follow this guide: https://3.5inchfloppy.com/using-the-acc-uzb-u-sta-z-wave-controller-with-zniffer/

The link to download Zniffer is valid.

blhoward2 commented 2 years ago

They just updated the guide this week, so perhaps they moved the firmware back so it is installed with the zniffer software as it used to be:

https://community.silabs.com/s/article/How-to-get-Z-Wave-PC-Controller-and-Zniffer-tools?language=en_US

justindthomas commented 2 years ago

Exciting!

Screenshot (2)

blhoward2 commented 2 years ago

So we know, where did you find the firmware?

justindthomas commented 2 years ago

I used the guide that @kpine posted and downloaded the files as directed there. It's included as specified there in the relative link in the downloaded file from https://www.silabs.com/development-tools/wireless/z-wave/embedded-development-kit.

Now it is time to flash the Zniffer firmware. From the Flash Code Memory tab, select the hex file located at: \Zniffer_v4_57\Z-Wave_Firmware\sniffer_ZW050x_USBVCP.hex

The only hiccup I had was that SiLabs really wanted me to download the SDK first before letting me download the programmer or zniffer, so I had to capitulate to that.

blhoward2 commented 2 years ago

Ok. They must have fixed it after I asked about it. In recent releases the firmware wasn’t distributed at all and that folder didn’t exist.

justindthomas commented 2 years ago

That ZEN32 node (85) was marked dead again during a heal process about an hour ago.

dead_node_85.log

Definitely a lot going on right there. It looks like that might have been around the time that I was fiddling with the sensor in my son's room (a ZSE18) trying to get it to wake up to heal. That device's battery is now showing near 0, which is odd (it wasn't anywhere near there earlier), so that could play in to it.

I can almost buy the argument that the network is just busy and that congestion is causing problems. But the fact of the matter is that I only ever saw nodes just up and die when I was using the 700 device before I switched back to the 500. Then I saw none for the duration of using the migrated 500 device (a few weeks/couple of months) through countless heal operations, inclusions, exclusions, re-inclusions to move from unauth to S2, etc. I was quite busy - never a dead node.

Then switching back to the upgraded 700 device, I see dead nodes again. It's definitely better than it was - I've only seen it happen with this one ZEN32 and I was seeing a lot more problems before. But it seems like the problem is not fully solved.

AlCalzone commented 2 years ago

If you're able to reproduce it somehow, having a driver log and a Zniffer log of the node going "dead" would be interesting.

justindthomas commented 2 years ago

Sounds good. I'll work on figuring out a way to do that. Maybe just letting the zniffer run for 8 hours at a time or something like that.

blhoward2 commented 2 years ago

That’s really the only way. I don’t have mine running but I believe it can export versus a screen shot

justindthomas commented 2 years ago

I assume I can export a capture file like tcpdump or Wireshark. I'll work on becoming more familiar with it.

I wonder if anyone has worked on building a CLI tool like tcpdump to use the hardware or if that would be a licensing problem. It would sure be handy.

blhoward2 commented 2 years ago

Some people were working on one before the spec was released publicly. It seems that once the zniffer software was available no one bothered. Zwave.me has a custom zniffer setup in their new devices at least.

We’ve flirted with if there was a way to put the stick in promiscuous mode (like a zniffer but your network only, doesn’t require new fw)

blhoward2 commented 2 years ago

We’ve flirted with if there was a way to put the stick in promiscuous mode (like a zniffer but your network only, doesn’t require new fw)

justindthomas commented 2 years ago

It's been a little bit since I've seen a node die unexpectedly. I'm going to go ahead and close this and I'll reopen with trace data if it recurs.

justindthomas commented 2 years ago

Reopening to look at this log:

03T19.log

@AlCalzone this is the one I mentioned on the META ticket. There is indeed a little bit more context in that conversation that isn't prefaced by [Node 055] and was missed in that previous grep.

2022-02-03T19:47:43.427Z CNTRLR   [Node 087] The node is alive.
2022-02-03T19:47:43.518Z CNTRLR   [Node 087] The node is ready to be used
2022-02-03T19:47:43.518Z CNTRLR   All nodes are ready to be used
2022-02-03T19:47:43.534Z CNTRLR « [Node 087] ping successful
2022-02-03T19:47:43.538Z SERIAL » 0x01050051373aa6                                                     (7 bytes)
2022-02-03T19:47:43.539Z DRIVER » [Node 055] [REQ] [AssignSUCReturnRoute]
2022-02-03T19:47:43.546Z SERIAL « [ACK]                                                                   (0x06)
2022-02-03T19:47:43.566Z SERIAL « 0x0104015101aa                                                       (6 bytes)
2022-02-03T19:47:43.568Z SERIAL » [ACK]                                                                   (0x06)
2022-02-03T19:47:43.569Z DRIVER « [RES] [AssignSUCReturnRoute]
2022-02-03T19:47:43.575Z SERIAL « 0x012100513a0400000000000000000000000000000000000000000000000000000 (35 bytes)
2022-02-03T19:47:43.576Z SERIAL » [ACK]                                                                   (0x06)
2022-02-03T19:47:43.577Z DRIVER « [REQ] [AssignSUCReturnRoute]
2022-02-03T19:47:43.582Z CNTRLR   [Node 055] The node did not respond, it is presumed dead
2022-02-03T19:47:43.584Z CNTRLR   [Node 055] The node is now dead.
2022-02-03T19:47:46.337Z DRIVER   Usage statistics sent - next transmission scheduled in 23 hours.

This was on startup, so the system was busy pinging all of the nodes. Node 55 was identified as alive in that process, and then dead immediately after.

Unfortunately didn't have my zniffer on, but I'm setting that up now.

EDIT: I ran a network heal with the zniffer running to see if anything died. Nothing did, but that node (55) refused to heal. All the others worked fine. I tried healing it individually and it still failed.

So I re-interviewed it and that seems to have helped. It healed on this last attempt - maybe the server just had bad info stored. I'll keep the zniffer running and report back.

AlCalzone commented 2 years ago

@justindthomas This still isn't a complete log. Many logs are multiline, including the DRIVER and SERIAL ones, so this is missing information. Please no grep, just copy the relevant timeframe or post the entire thing.

justindthomas commented 2 years ago

Ah, yes. I overlooked that. Here you go.

zwavejs_2022-02-03.log

justindthomas commented 2 years ago

That LZW31 (node 55) was in the wrong place in the zwavejs2mqtt list this morning and I noticed that it had a "-" in the "Zwave+" column. So I interviewed it again and it's registering correctly. But now it's refusing to heal.

Here's the current log with that heal attempt.

zwavejs_2022-02-04.log

justindthomas commented 2 years ago

Node 85 (a ZEN32) just died. Here's the updated log.

zwavejs_2022-02-04.log

I did leave the zniffer running and thought I'd caught it, but apparently the system pauses the capture when the screen locks (I don't think it goes to sleep, because I have a HyperV VM running on that host that doesn't stop responding). Anyway, I need to fix that.

AlCalzone commented 2 years ago

So... first things first:

The "dead" in the first log is actually the SUC return route that fails to be assigned. I guess that shouldn't count as "the node is dead" really. Raised https://github.com/zwave-js/node-zwave-js/issues/4190

2022-02-03T19:47:43.577Z DRIVER « [REQ] [AssignSUCReturnRoute]
                                    callback id:     58
                                    transmit status: NoRoute

I interviewed it again and it's registering correctly. But now it's refusing to heal.

Looks like it won't update its neighbors. Any chance you got a Zniffer trace of that? I've raised https://github.com/zwave-js/node-zwave-js/issues/4192 for this, maybe we can at least update routes in that case.

Node 85 (a ZEN32) just died. Here's the updated log.

Oh no!

2022-02-04T02:32:44.834Z DRIVER « [REQ] [SendDataBridge]
                                    callback id:                           98
                                    transmit status:                       Fail, took 0 ms
                                    routing attempts:                      0
                                    protocol & route speed:                Unknown (0x00)

This looks like the 700 series issue all over, but this time it actually looks like it is correct. When this happens (12x in the last log), one or two of your nodes (40, 44, 48, 68, ...) are flooding the network. Although that could just be a symptom - those Zniffer logs would come in really handy this time.

Anyways, I've raised https://github.com/zwave-js/node-zwave-js/issues/4191 to NOT treat this as a dead node, but rather a jammed controller.

justindthomas commented 2 years ago

Thanks. I've restarted my zniffer and adjusted my power settings so hopefully it will be running when that occurs again.

For your reference, those nodes are Inovelli LZW31, LZW30-SN, and Zooz ZSE18 devices. 68 is a Zooz ZSE18 - the flooding may have been when I was waking it up to heal it (it's kind of a pain in the neck to wake up and I have to try numerous times - lots of pushing its single button).

justindthomas commented 2 years ago

I re-read your question again and realize I totally do have the zniffer log for that network heal where node 55 refused to heal. I started it at the beginning of the heal and stopped at the end. Here it is.

02032022-network-heal.zip

That should line up with the data in zwavejs_2022-02-04 starting at 00:13:15 (I'm UTC-8 and the zniffer system is running on local time).

AlCalzone commented 2 years ago

I notice there are quite a few CRC_ERRORs in the zniffer log when the controller tries to get 55's neighbors: grafik

here's 49 for comparison: grafik

The CRC_ERRORs appear frequently, so I'm not sure if that's just the Zniffer picking up some weird signal. Its just that it seems to happen every time the controller tries to directly communicate with 55, so there must be something to it...

justindthomas commented 2 years ago

I have a couple of outdoor ZSE41 and ZSE43 sensors that I suspect might be causing those errors. They were paired to my 500 stick after I migrated back from the 700 and I haven't reset and reincluded them to the 700 stick since upgrading its firmware and putting it back in service.

So they have the right home ID and everything, the stick just doesn't know what to do with them yet. One is physically beyond node 55 and I see now how that might be a problem. I'll reset those today and see if it reduces the CRC errors.

justindthomas commented 2 years ago

Got those ZSE devices sorted out and back in line, but that didn't seem to stem the CRC errors.

It looks like the RSSI values for the CRC error signals are significantly stronger than average. Does that indicate that the source device is quite close to the zniffer?

justindthomas commented 2 years ago

I went a little crazy running around, killing power to anything that looked like it might be involved in those CRC errors (i.e., watching for when they popped up and then turning off any nodes with traffic within a few milliseconds of the error). But the errors don't stop - I suspect they may be an artifact of the Zniffer itself.

AlCalzone commented 2 years ago

It looks like the RSSI values for the CRC error signals are significantly stronger than average

I think its just zeroed out as "no value".

I suspect they may be an artifact of the Zniffer itself.

Might be, can you try a different placement? Maybe on an extension cord?

zwave-js-assistant[bot] commented 1 year ago

This issue has not seen any recent activity and was marked as "stale 💤". Closing for housekeeping purposes... 🧹

Feel free to reopen if the issue persists.

zwave-js / node-zwave-js