switch turned ON by service switch.turn_OFF

knwpsk commented 2 years ago

Is your problem within Home Assistant (Core or Z-Wave JS Integration)?

NO, my problem is NOT within Home Assistant or the ZWave JS integration

Is your problem within ZWaveJS2MQTT?

NO, my problem is NOT within ZWaveJS2MQTT

Checklist

[X] I have checked the troubleshooting section and my problem is not described there.
[X] I have read the changelog and my problem was not mentioned there.

Describe the bug

When I issue a command to turn_off a switch, the switch (light) doesn't turn off. When I issue a command to turn_on, it doesn't turn on. In the logs, when it receives a turn_off command, it logs turning off and then ON again... and vice versa.

I opened a ticket in HA Core https://github.com/home-assistant/core/issues/77051 and they looked at the logs, and recommended opening it here instead.

This is happening specifically with three Minoston zwave outdoor plugs, and no other devices. All three plugs have been on my network for months and working just fine; and all three started misbehaving around the same time, about a week to 10 days ago(?). The bug doesn't happen every time, but anecdotally it seems like about 50% of the time.

I can easily reproduce it simply by clicking the on/off switch in the Device page in HA. When I click On, right there on the Device page it says: Switch1 was turned ON by service switch.turn_ON Switch1 was turned OFF by service switch.turn_ON and vice versa.

I tried switching from zwaveJS to zwaveJSMQTT (I wanted to do that anyway), and the problem persists.

I noticed that this same issue was reported a couple of years ago here: https://github.com/home-assistant/core/issues/42798 In the old ticket, it was another device/manufacturer. In my case, the "switches" are Minoston zwave outdoor smart plugs. https://minoston.com/product/z-wave-plus-smart-plug-outdoor-on-off-outlet-switch-mp22z/ I wrote to Minoston several days ago for any info, but no response yet. (But like I said, they worked fine for months, and I have not updated their firmware.)

In the attached log, Node 26 is the switch that I used to recreate the problem.

THANK YOU

Device information

Minoston zwave outdoor smart plugs. https://minoston.com/product/z-wave-plus-smart-plug-outdoor-on-off-outlet-switch-mp22z/

How are you using `node-zwave-js`?

[X] zwavejs2mqtt Docker image (latest)
[ ] zwavejs2mqtt Docker image (dev)
[ ] zwavejs2mqtt Docker manually built (please specify branches)
[ ] ioBroker.zwave2 adapter (please specify version)
[ ] HomeAssistant zwave_js integration (please specify version)
[ ] pkg
[ ] node-red-contrib-zwave-js (please specify version, double click node to find out)
[ ] Manually built from GitHub (please specify branch)
[ ] Other (please describe)

Which branches or versions?

zwavejs2mqtt: 6.15.2 zwave-js: 9.6.2

Did you change anything?

no

If yes, what did you change?

No response

Did this work before?

Yes (please describe)

If yes, where did it work?

Worked fine for months, started acting stupid just 7-10 days ago, and happens every day now.

Attach Driver Logfile

zwavejs_2022-08-19.log

libby2cm commented 2 years ago

I have been facing this same issue for awhile as well. Random devices, whether triggered via automations, scenes, manually through HA entity or via zwavejs2mqtt control panel, will frequently change state, then revert a second later. If you try again, the operation succeeds. I suspect it may have something to do with the optimistic update logic (maybe a race condition?), but I don't have any data to support that. No failures of operations shows in logs, it appears as if I had initiated a toggle on the entity.

I can't easily reproduce it, but a period of idle time seems to potentially be a factor as well. Happy to provide more information, I just hadn't opened an issue yet because I didn't have any good data. Interested in theories though.

AlCalzone commented 2 years ago

@knwpsk I'm not sure what exactly is going on with the device, but it seems to frequently forget the shared S2 encryption state:

node -> zwave-js: Here's my status, did you get it?
zwave-js -> node: I got it.
(radio silence, meaning node understood the response)

seconds later, no:
zwave-js -> node: Turn on
node -> zwave-js: I don't understand.

The radio silence could also mean that the response, or the node's "I don't understand", was lost.

I expect this to become better in the upcoming v10 release where I overhauled the S2&Supervision communication. For now you can try eliminating connectivity issues as the source of the issue: https://zwave-js.github.io/node-zwave-js/#/troubleshooting/connectivity-issues?id=general-troubleshooting and https://zwave-js.github.io/node-zwave-js/#/troubleshooting/network-health?id=testing-the-connection-strength

knwpsk commented 2 years ago

@AlCalzone thanks for this. Pls make sure you see the part where this is happening on three devices of same make/model, and that they all started doing this around the same time, and that there was no firmware change. It suggests that there was a change in HA (they say no) or the zwave driver (??) that triggered the problem. Based on nothing at all, just gut, I'll assume that the Minoston devices were doing something odd all along, but that some recent change on "our" side (HA or driver?) became less tolerant of it...??

Connectivity Issues page: I read the connectivity issues page, and none of that seems likely. Already have my stick on an extension cord, several feet away from my rack and other equipment; and none of my 20+ other devices are acting up. Two of the three Minoston devices are within a few feet of several more zwave nodes. The third is a bit farther away from the core of my network, but it's also still only a few feet from one other node.

Network health... This is interesting. I have hurricane-proof glass windows. I have wondered if they cause any kind of radio issues (but there are a number of other things going on that suggest not). Also have "lots of stuff" using radio in various ranges, so I wonder if any kind of signal interference is relevant. I just switched from native zwavejs to zwave-mqtt, so I will definitely try out the health check and see what that offers. If I find anything I'll post here. (Thanks!)

AlCalzone commented 2 years ago

The only thing that did change in the driver in this regard was on July 5th (9.6.0), where the driver stopped requesting an acknowledgement for supervision reports. So we no longer notice if the devices didn't get this report. We didn't do anything with this information before, but this could block the send queue for several seconds if there were issues reaching the devices. The change shouldn't have an though, unless the devices ignore these commands.

knwpsk commented 2 years ago

Making sure I understand what you described: Before the July 5 change, if the driver didn't get an ack from the device, that could block the send queue for a few seconds. After the change, the driver no longer stopped looking for an ack, so the block should no longer happen. So, logically, that change couldn't have been what started my problem; if anything, that change would have fixed my problem (if it was related at all). Is that your point?

(I have a pretty thorough understanding of messaging protocols in general, but very little experience with zwave, and I'm fairly new to HA.)

Also.... do "you guys" have any contacts at Minoston? Seems like it would be nice to get their input here. They didn't respond to my helpdesk email.

AlCalzone commented 2 years ago

Yeah that's my point. That change is the only change that could be somewhat relevant, but I wouldn't expect any effect like you describe. I wonder if it still reproduces when you downgrade to whatever the setup was before this started happening.

I don't have a contact at Minoston sadly.

knwpsk commented 2 years ago

I tried to run HealthCheck the way it's described here, https://zwave-js.github.io/zwavejs2mqtt/#/guide/healthcheck but I get Connection Refused. I'm groping here -- any tips?

I tried running from Console on the HA; and I tried running it from command prompt on a local PC. Same both times.

AlCalzone commented 2 years ago

Wrong health check :) I don't think there's user documentation for the one you should use though. Go on the network map in z2m, click the node you want to check. A gray panel opens with some routing information and a button "check health". Click that one.

knwpsk commented 2 years ago

OK would be great if the "improving network health" page actually gave those instructions! (Thanks again)

Ran health check on Plug2 (the one that most often exhibits the problem), tested it connecting to Node1, my control stick, because the map says it uses a direct connection to the stick. Got some iffy results, attached.

There are several other nodes that are closer to Plug2, in between it and the stick. Is there a way to tell it to try using those instead? I wonder if that would help it. (I've healed the network several times in the past.)

healthCheck_26-1.zip

knwpsk commented 2 years ago

Also tested Plug2 (Node26) to Node23 (a peer, closer than the controller stick), and it gets 10 out of 10 on every test.

knwpsk commented 2 years ago

Ran full network Heal. Now the map no longer shows Plug2 (node26) with a direct connection to control stick (node1). Ran Health Check from 26 to 1, and now it all comes up 9 out of 10. Hoping this helps. Will play with it and see.

[update] Uh, no. Healing didn't fix it. First time I tried to turn on plug2 from HA, it came on ok. Second step, clicked to turn off. It failed to turn off, bounced back to on, and then went "unavailable". Meanwhile plug1 also failed to turn off, stayed on, but stayed Available at least.

AlCalzone commented 2 years ago

It failed to turn off, bounced back to on, and then went "unavailable".

Got a driver log of this? This smells a lot like physical connectivity issues though.

knwpsk commented 2 years ago

Didn't log it first time, but just did it again and recreated it.

This time I used an Automation triggered by a wall switch, to turn on (or off) Plug1 and Plug2 at the same time. Both plugs acted ok for several on/off iterations, and then Plug2 failed the last time.

Log attached. Plug2 is Node26; Plug1 is Node25. zwavejs_2022-08-22.log

AlCalzone commented 2 years ago

Please disable any log filters and the the loglevel to "debug". Then try again. The log you posted is missing a lot of information.

knwpsk commented 2 years ago

Dangit. OK will re-do in a little while. Wife just yelled at me for turning the lights on/off from remote... :)

knwpsk commented 2 years ago

OK ran it again, with proper logging this time. Simply: Started with Plug 2 (Node 26) off, clicked the On switch on lovelace. The plug turned on, then off again. Log attached.

zwavejs_2022-08-23.log

(BTW I just updated HA to 2.6.)

AlCalzone commented 2 years ago

Ok, so part of this may see improvement in the v10 release, but something is really not right in your network:

2022-08-23T13:17:48.502Z SERIAL « 0x011c00a800011a139f038f0076e769c69070706d0ac69760b37bb800be66      (30 bytes)
2022-08-23T13:17:48.505Z SERIAL » [ACK]                                                                   (0x06)
2022-08-23T13:17:48.506Z DRIVER « [Node 026] [REQ] [BridgeApplicationCommand]
                                  │ RSSI: -66 dBm
                                  └─[Security2CCMessageEncapsulation] [INVALID]
                                      error: Security2CC_CannotDecode
2022-08-23T13:17:48.507Z CNTRLR » [Node 026] Message authentication failed, cannot decode command. Requesting a 
                                  nonce...
2022-08-23T13:17:48.585Z SERIAL « 0x011c00a800011a139f038f0076e769c69070706d0ac69760b37bb800be66      (30 bytes)
2022-08-23T13:17:48.587Z DRIVER   Dropping message with invalid payload
2022-08-23T13:17:48.587Z DRIVER « [Node 026] [REQ] [BridgeApplicationCommand]
                                  │ RSSI: -66 dBm
                                  └─[Security2CCMessageEncapsulation] [INVALID]
                                      error: Duplicate command
2022-08-23T13:17:48.587Z SERIAL » [ACK]                                                                   (0x06)
2022-08-23T13:17:48.669Z SERIAL « 0x011c00a800011a139f038f0076e769c69070706d0ac69760b37bb800be66      (30 bytes)
2022-08-23T13:17:48.670Z DRIVER   Dropping message with invalid payload
2022-08-23T13:17:48.671Z DRIVER « [Node 026] [REQ] [BridgeApplicationCommand]
                                  │ RSSI: -66 dBm
                                  └─[Security2CCMessageEncapsulation] [INVALID]
                                      error: Duplicate command
2022-08-23T13:17:48.671Z SERIAL » [ACK]                                                                   (0x06)
2022-08-23T13:17:49.069Z SERIAL « 0x011c00a800011a139f038f0076e769c69070706d0ac69760b37bb800bc64      (30 bytes)
2022-08-23T13:17:49.071Z DRIVER   Dropping message with invalid payload
2022-08-23T13:17:49.072Z DRIVER « [Node 026] [REQ] [BridgeApplicationCommand]
                                  │ RSSI: -68 dBm
                                  └─[Security2CCMessageEncapsulation] [INVALID]
                                      error: Duplicate command
2022-08-23T13:17:49.072Z SERIAL » [ACK]                                                                   (0x06)
2022-08-23T13:17:49.205Z SERIAL « 0x011c00a800011a139f038f0076e769c69070706d0ac69760b37bb800ab73      (30 bytes)
2022-08-23T13:17:49.206Z DRIVER   Dropping message with invalid payload
2022-08-23T13:17:49.207Z DRIVER « [Node 026] [REQ] [BridgeApplicationCommand]
                                  │ RSSI: -85 dBm
                                  └─[Security2CCMessageEncapsulation] [INVALID]
                                      error: Duplicate command
2022-08-23T13:17:49.207Z SERIAL » [ACK]                                                                   (0x06)
2022-08-23T13:17:49.464Z SERIAL « 0x011c00a800011a139f038f0076e769c69070706d0ac69760b37bb800ab73      (30 bytes)
2022-08-23T13:17:49.465Z DRIVER   Dropping message with invalid payload
2022-08-23T13:17:49.465Z DRIVER « [Node 026] [REQ] [BridgeApplicationCommand]
                                  │ RSSI: -85 dBm
                                  └─[Security2CCMessageEncapsulation] [INVALID]
                                      error: Duplicate command
2022-08-23T13:17:49.466Z SERIAL » [ACK]                                                                   (0x06)
2022-08-23T13:17:49.512Z SERIAL « 0x011c00a800011a139f038f0076e769c69070706d0ac69760b37bb800bc64      (30 bytes)
2022-08-23T13:17:49.513Z DRIVER   Dropping message with invalid payload
2022-08-23T13:17:49.513Z DRIVER « [Node 026] [REQ] [BridgeApplicationCommand]
                                  │ RSSI: -68 dBm
                                  └─[Security2CCMessageEncapsulation] [INVALID]
                                      error: Duplicate command
2022-08-23T13:17:49.513Z SERIAL » [ACK]                                                                   (0x06)
2022-08-23T13:17:49.619Z SERIAL « 0x011c00a800011a139f038f0076e769c69070706d0ac69760b37bb800ad75      (30 bytes)
2022-08-23T13:17:49.620Z DRIVER   Dropping message with invalid payload
2022-08-23T13:17:49.621Z DRIVER « [Node 026] [REQ] [BridgeApplicationCommand]
                                  │ RSSI: -83 dBm
                                  └─[Security2CCMessageEncapsulation] [INVALID]
                                      error: Duplicate command
2022-08-23T13:17:49.621Z SERIAL » [ACK]                                                                   (0x06)

This flood of messages seems to happen almost every time you control the node. I usually see this when there is interference or some other connectivity issue - not necessarily with the node itself, this can also be caused by repeaters. This is the same command, re-transmitted over and over. Only the RSSI changes, but if you look closely, this wildly fluctuates between -66 and -85 dBm.

Interestingly, the same happens with other nodes too:

2022-08-23T13:17:51.420Z SERIAL « 0x010c00a80001080326030000b2c6                                      (14 bytes)
2022-08-23T13:17:51.422Z CNTRLR   [Node 008] [~] [Multilevel Switch] currentValue: 0 => 0           [Endpoint 0]
2022-08-23T13:17:51.425Z SERIAL » [ACK]                                                                   (0x06)
2022-08-23T13:17:51.429Z DRIVER « [Node 008] [REQ] [BridgeApplicationCommand]
                                  │ RSSI: -78 dBm
                                  └─[MultilevelSwitchCCReport]
                                      current value: 0
2022-08-23T13:17:51.480Z SERIAL « 0x010c00a80001080326030000b3c7                                      (14 bytes)
2022-08-23T13:17:51.481Z CNTRLR   [Node 008] [~] [Multilevel Switch] currentValue: 0 => 0           [Endpoint 0]
2022-08-23T13:17:51.485Z SERIAL » [ACK]                                                                   (0x06)
2022-08-23T13:17:51.487Z DRIVER « [Node 008] [REQ] [BridgeApplicationCommand]
                                  │ RSSI: -77 dBm
                                  └─[MultilevelSwitchCCReport]
                                      current value: 0
2022-08-23T13:17:51.576Z SERIAL « 0x010c00a80001080326030000b1c5                                      (14 bytes)
2022-08-23T13:17:51.578Z CNTRLR   [Node 008] [~] [Multilevel Switch] currentValue: 0 => 0           [Endpoint 0]
2022-08-23T13:17:51.580Z SERIAL » [ACK]                                                                   (0x06)
2022-08-23T13:17:51.581Z DRIVER « [Node 008] [REQ] [BridgeApplicationCommand]
                                  │ RSSI: -79 dBm
                                  └─[MultilevelSwitchCCReport]
                                      current value: 0
2022-08-23T13:17:51.605Z SERIAL « 0x010c00a80001080326030000b0c4                                      (14 bytes)
2022-08-23T13:17:51.606Z CNTRLR   [Node 008] [~] [Multilevel Switch] currentValue: 0 => 0           [Endpoint 0]
2022-08-23T13:17:51.608Z SERIAL » [ACK]                                                                   (0x06)
2022-08-23T13:17:51.609Z DRIVER « [Node 008] [REQ] [BridgeApplicationCommand]
                                  │ RSSI: -80 dBm
                                  └─[MultilevelSwitchCCReport]
                                      current value: 0
2022-08-23T13:17:51.692Z SERIAL « 0x010c00a80001080326030000b1c5                                      (14 bytes)
2022-08-23T13:17:51.693Z CNTRLR   [Node 008] [~] [Multilevel Switch] currentValue: 0 => 0           [Endpoint 0]
2022-08-23T13:17:51.695Z SERIAL » [ACK]                                                                   (0x06)
2022-08-23T13:17:51.696Z DRIVER « [Node 008] [REQ] [BridgeApplicationCommand]
                                  │ RSSI: -79 dBm
                                  └─[MultilevelSwitchCCReport]
                                      current value: 0

Sending messages to Node 26 frequently takes more than one attempt:

2022-08-23T13:17:52.661Z DRIVER « [REQ] [SendDataBridge]
                                    callback id:            33
                                    transmit status:        OK, took 1290 ms
                                    repeater node IDs:      49
                                    routing attempts:       2
                                    protocol & route speed: Z-Wave, 100 kbit/s
                                    ACK RSSI:               -86 dBm
                                    ACK RSSI on repeaters:  -81 dBm

...

2022-08-23T13:17:53.936Z DRIVER « [REQ] [SendDataBridge]
                                    callback id:            34
                                    transmit status:        OK, took 540 ms
                                    repeater node IDs:      43, 25
                                    routing attempts:       4
                                    protocol & route speed: Z-Wave, 40 kbit/s
                                    ACK RSSI:               -66 dBm
                                    ACK RSSI on repeaters:  -61 dBm, -66 dBm
                                    ACK channel no.:        1
                                    TX channel no.:         1

...

2022-08-23T13:17:54.655Z DRIVER « [REQ] [SendDataBridge]
                                    callback id:            35
                                    transmit status:        OK, took 600 ms
                                    repeater node IDs:      43, 25
                                    routing attempts:       3
                                    protocol & route speed: Z-Wave, 40 kbit/s
                                    ACK RSSI:               -66 dBm
                                    ACK RSSI on repeaters:  -63 dBm, -66 dBm
                                    ACK channel no.:        1
                                    TX channel no.:         1

...
etc

so it is very likely that the node has similar troubles reaching the controller. I'd suggest taking a good look at the used repeaters nodes here.

Luckily your controller and nodes offer a lot of insight, as long as you are reading the driver logs. Here's an example trying to reach node 7:

transmit status:        OK, took 70 ms
repeater node IDs:      24, 27, 52
routing attempts:       1
protocol & route speed: Z-Wave, 40 kbit/s
ACK RSSI:               -68 dBm
ACK RSSI on repeaters:  -94 dBm, No signal detected, -66 dBm

70ms with 3 hops and 1 attempt is a good number, but if you correlate the repeaters node IDs and the ACK RSSI on repeaters, you'll see that the controller and node 52 have almost the same RSSI. Node 24's value is really low (probably very close to the background noise) and 27 didn't even detect signal.

I'd go through the logs, search for OK, took to get these reports and check if any have more than 1 routing attempt and which repeaters are taken with which RSSI. If any stand out, physically moving them or adding more repeaters around them may help.

You may also see route failed between: X -> Y which can give you an indication of bad links, but I don't see that right now.

knwpsk commented 2 years ago

OK super great help, thank you! On gut, I agree with you that this feels like some kind of interference or "network storm" kind of problem. I just needed this kind of help to learn to troubleshoot. Will be back when I have some time. FYI going on vacation for a couple of weeks starting Friday, so it may take me a bit to use all your good advice and progress this.

AlCalzone commented 2 years ago

Have a nice vacation then, and feel free to reopen this issue if the bot closes it in the meantime due to lack of activity.

knwpsk commented 2 years ago

Hey @AlCalzone, in the meantime -- what will I do when I further digest these logs?

For example: You pointed out that Node 24 has low signal strength. That's a keypad in my master bedroom. It's pretty far from the outdoor plug (Plug 2, Node 26) we've been talking about. I can't fathom why Plug 2 would try to use that keypad as a relay -- especially when there are several nodes much closer to it, and in a more direct path between it and my controller. So, is there a way I can tell Plug2/Node 26: "Please stop trying to use Node 24 as a relay" ? Or, "Node 26, you are only allowed to use these following nodes as relays" ?

(edit: I updated this a couple of times as I confirmed the names and locations of some of my devices! When I switched to mqtt I lost some of that...)

knwpsk commented 2 years ago

Making notes for later ref.

I did a whole bunch of cleanup this afternoon. I excluded and re-included 7-8 devices that were not secure, and got them all on S2 authentication. Then I cleaned up everything that was logging errors (broken automations etc, mostly resulting from either my convertion to mqtt or my exclude/include cleanup). Gave proper names to all the zwave devices so I can identify them easily for tshooting.

So far my zwave outdoor plugs "seem" to be acting better. Will give it some time and come back here with any updates.

AlCalzone commented 2 years ago

I can't fathom why Plug 2 would try to use that keypad as a relay

Z-Wave route finding is a bit weird. If a working route is found, the node will keep using it until it doesn't. It then tries its next-to-last-working route until it falls back to finding a new route, typically using explorer frames (which find a way to the destination, not necessarily a good one). Looking at the amount of re-routing required, the latter probably happened a couple of times and ended up with the route via 24.

In a strong mesh, you shouldn't end up with these weird routes, and the routing mechanisms should be able to find a decently good route to use automatically. It is possible to set a priority route which is always tried first, but this isn't implemented in Z-Wave JS yet.

knwpsk commented 2 years ago

If I "heal the network", that tells the controller and nodes to re-evaluate all of those mappings/routes, right? Or do I need to do something more fundamental than that to get my network to a state of 'strong mesh" ?

What you're saying makes a lot of sense. I'm new to HA and zwave, and have been learning as I go. I've installed and uninstalled and re-installed a bunch of stuff multiple times as I figure it all out -- not just learning the tech, but also learning how I want to use it, and then observing the weaknesses in my house geography. So... I'm sure there are a number of "bad routes" buried in my devices' knowledge.

knwpsk commented 2 years ago

More curious-ness. I just started a network heal. Already I see some strange stuff. Pls see the end of this log: there seem to be multiple nodes that don't reply to the controller's request for a list of neighbors after 5 attempts, and then report "healing failed."

But then, those same devices do show a list of nodes in "get routing info"-- but the list of node numbers looks crazy! I only have about 25 devices; and the highest node id is 75. So what are all those other numbers? zwavejs_2022-08-24.zip

AlCalzone commented 2 years ago

If I "heal the network", that tells the controller and nodes to re-evaluate all of those mappings/routes, right?

Correct.

Or do I need to do something more fundamental than that to get my network to a state of 'strong mesh" ?

That might require physical layout changes, to improve the physical connection strength between neighbors and reduce interference / wireless reflections near the nodes.

but the list of node numbers looks crazy!

That list comes from the controller memory. I've seen instances before where this was messed up. Refreshing the list of neighbors from nodes should be able to correct that, but as you noticed, those are failing. Something is definitely fishy there - disabling and enabling the radio before/after showing the network map didn't work either. Try doing this:

shut down z2m
pull the stick
wait 5 seconds
put it back in
start z2m again
retry the heal

Might also be worth to try out a different USB extension cable - not that yours is bad and causes a heap of issues.

knwpsk commented 2 years ago

Will do... and I'm gonna send this log to Zooz, too (it's their brand controller stick)

knwpsk commented 2 years ago

Wow... I did all that, and the results look way, way different! No more fails in the heal process. :)

Also, eyeballing the node map in z2m, the pathways look much more sensible now.

Here's hoping that the original issue is gone now! Thanks again for all your patience and help @AlCalzone

zwavejs_2022-08-24 1429pm.zip

zwave-js-assistant[bot] commented 2 years ago

This issue has not seen any recent activity and was marked as "stale 💤". Closing for housekeeping purposes... 🧹

Feel free to reopen if the issue persists.

zwave-js / node-zwave-js