zwave-js / node-zwave-js

Z-Wave driver written entirely in JavaScript/TypeScript
https://zwave-js.github.io/node-zwave-js/
MIT License
749 stars 599 forks source link

Aeotec Z-Stick 700 with FW 7.17.2.0 - "Duplicate command" errors after backup restoration #4352

Closed mundschenk-at closed 1 year ago

mundschenk-at commented 2 years ago

Is your problem within Home Assistant (Core or Z-Wave JS Integration)?

NO, my problem is NOT within Home Assistant or the ZWave JS integration

Is your problem within ZWaveJS2MQTT?

NO, my problem is NOT within ZWaveJS2MQTT

Checklist

Describe the bug

Yesterday, after updating the stick to FW 7.17.2.0, I restored a current NVM backup from my 500 series stick. The driver started up fine and the nodes seem to work OK. Unlike previous attempts, there were no dead nodes and no automatic re-interviews.

However, when checking the log files this morning, I noticed that it does contain a lot of "Duplicate command" errors, and not (just) from the Sensative Strips Guard 700. Could this be a side effect of a "Health Check" attempt I made (under the erroneous assumption that certain issues had been fixed since the last time I tried)? I did restart the driver afterwards though, to stop the network flooding.

Device information

Manufacturer: Aeotec Model name: Z-Stick 700 Node ID in your network: 1

How are you using node-zwave-js?

Which branches or versions?

version: node-zwave-js branch: 8.11.6 zwavejs2mqtt branch: 6.5.2 zwavejs2mqtt community addon: 0.36.0

Did you change anything?

yes (please describe)

If yes, what did you change?

Switched back to Z-Stick 700

Did this work before?

Yes (please describe)

If yes, where did it work?

With Z-Stick 5gen+

Attach Driver Logfile

zwavejslog_20220312.zip

zwave-js-bot commented 2 years ago

👋 Hey @mundschenk-at!

It looks like you attached a logfile, but its filename doesn't look like it a driver log that came from Z-Wave JS. Please make sure you upload the correct one.

mundschenk-at commented 2 years ago

It looks like you attached a logfile, but its filename doesn't look like it a driver log.

It is.

AlCalzone commented 2 years ago

For me, the logs look like the nodes don't get the expected acknowledgement from the controller and retry sending. Also there are some instances where the controller can't reach a node while the same node is sending reports to the controller, which would support my suspicion further.

I don't see any mention of you doing a network heal. Even if the backup restores the known routes, the physical differences between sticks may result in incorrect routes being used for attempting communication with the nodes.

mundschenk-at commented 2 years ago

I had not done a network heal when that log file was created, no. I've since started one, takes some time for the battery powered nodes. I'll upload tomorrow's logfile which should be mostly after the healing (only the smoke detector is not done yet, not sure when it's scheduled for its daily wakeup).

mundschenk-at commented 2 years ago

Addendum: According to the network graph, range still appears to be much worse with 700 series (though it's been advertised as having an improved range). I think I mentioned this previously in another ticket, I've a Strips Guard 700 on a window in the same room as the controller, maybe 5 m distance that takes 2 hops with the 700 series and only 1 hop with the 5gen+. This might explain the issues before the network heal (but not why range is so abysmal with the newer stick).

mundschenk-at commented 2 years ago

Here's today's log, @AlCalzone. The network heal finished at 01:00 with the last remaining node (45 - the others were healed before the log turnover), but unfortunately to me it looks very similar (lots of invalid payloads/duplicate commands): zwavejslog_20220313.zip

AlCalzone commented 2 years ago

Not sure if we went through this already - did you follow the instructions for controller placement mentioned here? https://zwave-js.github.io/node-zwave-js/#/troubleshooting/connectivity-issues?id=general-troubleshooting

mundschenk-at commented 2 years ago

We didn't, but I know about it and the stick's attached to a 1,5 m extension cord (above a wooden door frame, so no metal surface nearby either).

AlCalzone commented 2 years ago

Ok. What bugs me a bit are the low RSSI values (-88...-100 dBm), which are probably not much above the background noise, and they gradually get worse. Maybe this is less of a controller placement issue than a mesh strength issue.

You can see that at least node 18 has a stronger connection than most others:

2022-03-13T09:11:21.169Z DRIVER « [REQ] [SendDataBridge]
                                    callback id:            228
                                    transmit status:        OK, took 100 ms
                                    repeater node IDs:      18, 10
                                    routing attempts:       1
                                    protocol & route speed: Z-Wave, 100 kbit/s
                                    ACK RSSI:               -89 dBm
                                                            ^^ measured by the controller
                                    ACK RSSI on repeaters:  -63 dBm, N/A
                                                            ^^       ^^^ measured by node 10
                                                            ^^ measured by node 18
                                    ACK channel no.:        0
                                    TX channel no.:         0

In zwavejs2mqtt, on the network map, you have the ability to test the connection quality to individual nodes. I think this is your best bet now - identify nodes with a bad connectivity (might be the repeaters, not necessarily the "problematic" nodes) and try to optimize their connection.

You may also try to increase the TX power of the controller. At least partially your issue is caused by the nodes not being able to "hear" the controller. Note that this doesn't help if the controller can't "hear" some nodes though.

mundschenk-at commented 2 years ago

How would increase the controller's TX power? The health check on the network map is not working for me (it rapidly swamps the network, similar to https://github.com/zwave-js/node-zwave-js/issues/4130).

This is a relatively small appartment and the non-battery nodes are fairely evenly spaced. Node 18 is in the same direction as Node 10, but the latter is nearer to controller (basically they are located on opposite sites of the same wall: 1 -- wall -- "empty" room -- 10 -- wall -- 18 -- other room).

mundschenk-at commented 2 years ago

For comparison, this is the log from a few days ago using the Gen5+ controller (also Aeotec). Everything is the same except the controller (i.e. same nodes and placement): zwavejslog_20220309.zip

AlCalzone commented 2 years ago

How would increase the controller's TX power?

This is a bit tricky, since no UI exposes it AFAIK. You can do it via code though: grafik

I suggest the following:

  1. Go to the log tab, open log in a new window
  2. Go to the driver function dialog, execute this:
    await driver.controller.getPowerlevel()
  3. Look at the log and remember the two numbers. Should be something like -3.3 and 0 I think - mine are messed up (outside the -12.8...+12.7 dBm range).
  4. execute this:
    await driver.controller.setPowerlevel(
     3, // the normal powerlevel plus 3 (or so, need to test)
     0 // the output power at 0 dBm, unchanged!
    )

I would start with a slight adjustment of the normal powerlevel, e.g. +3 dBm. Note that increasing the powerlevel of a radio device may be against the law, so don't go overboard.

AlCalzone commented 2 years ago

The health check on the network map is not working for me (it rapidly swamps the network

Any chance I could see a log of that?

mundschenk-at commented 2 years ago

I would start with a slight adjustment of the normal powerlevel, e.g. +3 dBm. Note that increasing the powerlevel of a radio device may be against the law, so don't go overboard.

What's normal here? GetPowerlevel returned these values:

                                    command:               GetPowerlevel
                                    normal powerlevel:     3.2 dBm
                                    output power at 0 dBm: 14.8 dBm
mundschenk-at commented 2 years ago

The health check on the network map is not working for me (it rapidly swamps the network

Any chance I could see a log of that?

I've triggered the health check a few minutes ago: zwavejslog_20220314.zip

AlCalzone commented 2 years ago

I've triggered the health check a few minutes ago

Ok that's https://github.com/zwave-js/node-zwave-js/issues/4075

What's normal here? GetPowerlevel returned these values

I'm trying to find that out. Unfortunately my test stick has very strange values there. Will let you know once I've found the defaults

mundschenk-at commented 2 years ago

What's normal here? GetPowerlevel returned these values

I'm trying to find that out. Unfortunately my test stick has very strange values there. Will let you know once I've found the defaults

Is the power level of a Gen5 stick comparable (or interesting)? Then I could switch back to my other stick.

Also, regarding network strength, do you see similar issues in the Gen5 log or is it the 700 series stick? (To me, things seem better with the Gen5 stick, but I'm not that well-versed in reading the logs.)

AlCalzone commented 2 years ago

From what I've seen in the NVM data, the settings for 500 and 700 series sticks aren't comparable.

In your Gen5 log, I don't see RSSI for incoming messages, but the ACKs have RSSI values in the 50s/60s, roughly 30-40 dBm higher than on the 700 series. This could be caused by incorrect settings on the 700 series stick.

Start by setting the TX power to:

normal powerlevel: 0
measured: +3.3

which are the default settings. For good measure also set the RF region to EU, you never know...

await driver.controller.setRFRegion(0)
AlCalzone commented 2 years ago

@mundschenk-at It might be that you're also affected by this bug: https://github.com/zwave-js/node-zwave-js/issues/4367 In that case you won't be able to change the settings on your own without editing the NVM backup first. I can do that for you if you send me the NVM backup via e-mail.

mundschenk-at commented 2 years ago

@mundschenk-at It might be that you're also affected by this bug: #4367 In that case you won't be able to change the settings on your own without editing the NVM backup first. I can do that for you if you send me the NVM backup via e-mail.

Yes it definitely looks that way:

2022-03-15T15:12:04.787Z DRIVER « [RES] [SerialAPISetup]
                                    command: GetRFRegion
                                    region:  Unknown

I'll edit the backup myself, should not be a problem.

AlCalzone commented 2 years ago

Sure, see here for details: https://github.com/zwave-js/node-zwave-js/issues/3906#issuecomment-1068031298

mundschenk-at commented 2 years ago

I tried this: Made a new 700 series backup, converted to JSON and replaced "rfConfig": null with the object from https://github.com/zwave-js/node-zwave-js/issues/3906#issuecomment-1068031298. It worked, in that GetRFRegion returns Europe and the power levels are now 0.0 dBm and 3.3 dBm.

However, the whole network started to be reinterviewed immediately. As a result, several nodes had incomplete information and/or security classes. I manually re-interviewed them (using the "reset security classes" option) and was able to fix most of them (two are missing, one is a smoke detector that I would need a ladder to wake up, the other an Aeotec Water Sensor 6, which is quite buggy when using S0 security - I may have to reinclude that, maybe dropping security this time). Unfortunately, even after this, things are not really good.

I've tried starting a network heal, but that has stalled after just 4 nodes (i.e. not even all mains powered nodes have been healed).

AlCalzone commented 2 years ago

Unfortunately, even after this, things are not really good. I've tried starting a network heal, but that has stalled after just 4 nodes .

Can you post another log so I can see for myself? You don't have to encrypt them, unless you re-include a secure device and/or change user codes while you're logging.

mundschenk-at commented 2 years ago

Unfortunately, even after this, things are not really good. I've tried starting a network heal, but that has stalled after just 4 nodes .

Can you post another log so I can see for myself? You don't have to encrypt them, unless you re-include a secure device and/or change user codes while you're logging.

Driver restarts as well (home ID), no? Since these are pretty long logs (whole day), I feel more comfortable with encrypted uploads.

AlCalzone commented 2 years ago

The home ID is just a number that's transmitted in plaintext via radio anyways. If someone is standing near your network with a Zniffer, they will see it. Encrypted is fine for me though.

mundschenk-at commented 2 years ago

Here it is: zwavejslog_20220315.zip

(Sorry, a bit messy due to repeated re-interview/heal attempts.)

AlCalzone commented 2 years ago

From what I'm seeing in the logs, the heal is slowly progressing. You had a full interview of node 99 going on in parallel, which delays the process quite a bit.

The RSSI values didn't change, but that's expected, since you didn't change anything physical. I guess now you can try this and see if the nodes can "hear" the controller better then.

mundschenk-at commented 2 years ago

The RSSI values didn't change, but that's expected, since you didn't change anything physical. I guess now you can try this and see if the nodes can "hear" the controller better then.

Just to be clear, do you mean I should initially try 3, 0 as the parameters (as in the example), or rather 3, 3.3 (3 added to the previous "normal powerlevel" of 0.0, 0 added to the previous "output power at 0 dBm`)?

AlCalzone commented 2 years ago

I meant 3, 3.3.

mundschenk-at commented 2 years ago

Here's the new log. I think it is somewhat better, but node 10 still appears to be problematic (though I don't know why, it's a FGWP102, though you might consider it a legacy device - I didn't know how to read the specifications properly when ordering my first nodes).

Also, when doing a health check for node 11, node 80 started freaking out completely. This one is a bit remote (at the bottom of a cupboard at the other end of the flat), but there are plenty of repeaters in between if necessary.

zwavejslog_20220315a.zip

AlCalzone commented 2 years ago

Hm you're still running into the infinite healthcheck loop. Probably best if you wait for the next zwavejs2mqtt beta release, this will include a few fixes for this.

Not sure what's going on with Node 80 though. That is excessive and might warrant a bug report with the manufacturer. Such spamming can bring down the entire network.

mundschenk-at commented 2 years ago

Yes, but some health checks were successful (i.e. for nodes 18, 70, 100). Any improvements visible over the previous logs (signal strength-wise)? I have a hard time reading/comparing those.

Node 80 is an Aeotec smart plug set to report pretty frequently so that I know when a relative low-power device is on. I've never seen it spam the network this way, though. To me it appears to start with 2022-03-15T21:36:21.864Z CNTRLR « [Node 080] Beginning Transport Service RX session #14..., but I don't see what's triggering it.

AlCalzone commented 2 years ago

Yes, but some health checks were successful (i.e. for nodes 18, 70, 100). Any improvements visible over the previous logs (signal strength-wise)? I have a hard time reading/comparing those.

We're not going to see improvements in the signal strength. Making the stick "shout louder" doesn't make it "hear better". My hope was that the higher power leads to the acknowledgements reaching the nodes, so they stop re-transmitting the messages over and over.

The health checks are meant as a tool for you to identify nodes with problematic connectivity and potentially re-arrange a few nodes to improve on that. The next zwavejs2mqtt beta release will include fixes for the infinite loops and also show the actual routes used when you click on a node in the network map.

mundschenk-at commented 2 years ago

Yes, but some health checks were successful (i.e. for nodes 18, 70, 100). Any improvements visible over the previous logs (signal strength-wise)? I have a hard time reading/comparing those.

We're not going to see improvements in the signal strength. Making the stick "shout louder" doesn't make it "hear better". My hope was that the higher power leads to the acknowledgements reaching the nodes, so they stop re-transmitting the messages over and over.

Sorry, I worded that wrong. But did we see such an improvement? Otherwise, it appears that Z-Stick 7's antenna design (or something) is simply suboptimal compared to the Z-Stick Gen5+.

AlCalzone commented 2 years ago

I don't think its much better. The routing reports almost always indicate the same nodes though where the route fails: 18 -> 69, 69 -> 80 and 18 -> 80. Might be worth looking into.

mundschenk-at commented 2 years ago

That's weird. 18 and 69 are in completely different rooms with lots of stuff in between (walls and kitchen appliances) and no way the best routes to take. Same for 69 and 80.

18 -> 80 is somewhat legit (same room), although "80" is the leaf node, location-wise. The arrow indicates the direction, i.e. 18 hops to 80 which should hop to the controller? Or is it the other way around?

AlCalzone commented 2 years ago

Or is it the other way around?

It is. These reports indicate the routing from the controller to the nodes.

18 and 69 are in completely different rooms with lots of stuff in between (walls and kitchen appliances) and no way the best routes to take. Same for 69 and 80.

Try doing single node heals for those (and other potential nodes along the routes), step by step. Maybe this helps the controller figure out better routes.

mundschenk-at commented 2 years ago

I've tried healing individual nodes in between, but the routes have not become any better as far as I can see. Yesterday, I switched back to the 500 series stick, which seems to work much better overall. (I am helping Sensative debug the Strips Guard 700 firmware, and I don't want to change too many variables at the same time). I'll raise a ticket with Aeotec what's going on with their 700 series sticks.

Just so that I read the logs correctly, the ACK RSSI in this example is what the controller receives from the repeater node (70), and ACKK RSSI on repeaters is what the repeater node received from the target node?

2022-03-18T01:37:12.129Z DRIVER « [REQ] [SendData]
                                    callback id:            61
                                    transmit status:        OK, took 90 ms
                                    repeater node IDs:      70
                                    routing attempts:       1
                                    protocol & route speed: Z-Wave, 40 kbit/s
                                    ACK RSSI:               -68 dBm
                                    ACK RSSI on repeaters:  -91 dBm
                                    ACK channel no.:        1
                                    TX channel no.:         1
AlCalzone commented 2 years ago

That's almost correct. For the example you showed it is, but if there are multiple repeaters, you're going to have multiple measurements in the "on repeaters" line. E.g. if the route is (1 -> 70 -> 71 -> 68), then you'll see (for example)

repeater node IDs:      70, 71
ACK RSSI:               -68 dBm
ACK RSSI on repeaters:  -91 dBm, -88 dBm

where -88 is the RSSI of the ACK that 71 received from 68 and -91 is the RSSI of the ACK that 70 received from 71 and -68 is the RSSI of the ACK that the controller received from 70

AlCalzone commented 2 years ago

I'll raise a ticket with Aeotec what's going on with their 700 series sticks.

Let me know what they say.

mundschenk-at commented 2 years ago

Small follow-up question: What's a ”reasonable” RSSI? -91 dBm is pretty low, too low? (As in: one should do something about it, like adding a repeater at another location.)

AlCalzone commented 2 years ago

AFAIK this depends on the background RSSI. The higher that is, the higher the RSSI should be. The lifeline health check is modeled after what Silabs' PC Controller does and they recommend an RSSI margin (<RSSI> - <background RSSI>) of 17 dBm or more.

If you're interested in the background RSSI at the controller (it can't be measured at nodes), you can execute a driver function and watch the logs:

await driver.controller.getBackgroundRSSI()
mundschenk-at commented 2 years ago

Is there a driver command to dump the routing tables? Aeotec asked if I could provide those.

AlCalzone commented 2 years ago

The current state is included in the NVM backup. Make a fresh one, then convert it to json and copy all entries for lwr and nlwr.

zwave-js-assistant[bot] commented 2 years ago

This issue has not seen any recent activity and was marked as "stale 💤". Closing for housekeeping purposes... 🧹

Feel free to reopen if the issue persists.

mundschenk-at commented 2 years ago

I am still waiting on meaningful response from Aeotec. Their first level support has made some suggestions for a test setup, but it's not practical investment wise and I don't think they have looked at the provided logfiles at all.

mundschenk-at commented 2 years ago

@AlCalzone At Aeotec's advice I've created a test setup with just the 700 controller and a single node. Today I've run health checks with the node in increasing distance from the controller (in 1 m intervals). When I couldn't do straight line-of-sight anymore due to space constraints, I put the node about a further 2 m away, with a thin wall and some furniture (speakers) in between. (That's the measurement from 2022-04-23T09:19:45.952Z). Then I made an NVM backup to have a look at some of the internal settings after the factory reset.

Some time later I did a health check again without moving either the controller or the node. The results were pretty weird, as GetRoutingInfo now returned loads of neighbors (in a network that never had more than two nodes, and with node numbers not corresponding to any nodes of the ”real” network running on the gen5+ stick either). Is this the controller malfunctioning completely? Some weird interaction because of the backup?

zwavejs_2022-04-23_weird_routing.log

I'm running the test environment on an RPi4 with docker (node-zwave-js 9.0.4).

AlCalzone commented 2 years ago

The weird neighbors look like random data from the controller memory IMO. This information is stored as a bitmask and maybe the controller is reading the data from a weird location with uninitialized/random memory.

I'm not sure why a backup would cause this though, since it is just reading.

mundschenk-at commented 2 years ago

Maybe it was not the backup but an incomplete reset (since GetRFRegion returned unknown)?

AlCalzone commented 2 years ago

Could be... if you convert the JSON I sent you back to a binary NVM file and restore that, does GetRFRegion return something useful again?