Large amounts of lag when turning on/off multiple devices at once

zwave-js / node-zwave-js

Z-Wave driver written entirely in JavaScript/TypeScript

https://zwave-js.github.io/node-zwave-js/

MIT License

739 stars 588 forks source link

Large amounts of lag when turning on/off multiple devices at once #5331

Closed ColColonCleaner closed 1 year ago

ColColonCleaner commented 1 year ago

Is your problem within Home Assistant (Core or Z-Wave JS Integration)?

NO, my problem is NOT within Home Assistant or the ZWave JS integration

Is your problem within Z-Wave JS UI (formerly ZwaveJS2MQTT)?

NO, my problem is NOT within Z-Wave JS UI

Checklist

[X] I have checked the troubleshooting section and my problem is not described there.
[X] I have read the changelog and my problem was not mentioned there.

Describe the bug

What causes the bug? Turning on/off multiple devices at once, or communicating with many devices at once.

What do you observe? Around 5 of the devices respond instantly, then 5 more over a couple seconds. Then there is a very large gap in time, upwards of 30 seconds sometimes, after which the rest of the devices (10ish remaining) slowly trickle in and update as well. Toggling a single device is instant, very snappy. Multiple devices falls off a cliff very quickly as the number increases. Anything more than a couple devices is horrible.

What did you expect to happen? Expected them all to toggle immediately. Z-wave claims to support 200+ devices on a network, so toggling 1/10th of that at once I wouldn't expect to be an issue.

Steps to reproduce the behavior: Created a script in HA that uses entity 'all' for the light.turn_off service. Grabs all the 'light' entities and turns them off.

service: light.turn_off
data: {}
target:
  entity_id: all

Execute that when you have 20+ lights in the zwave network.

Device information

Manufacturer: Aeotec Model name: Z-Stick 7 Node ID in your network: 1 Running the latest 7.18.1 firmware.

How are you using `node-zwave-js`?

[X] zwave-js-ui (formerly zwavejs2mqtt) Docker image (latest)
[ ] zwave-js-ui (formerly zwavejs2mqtt) Docker image (dev)
[ ] zwave-js-ui (formerly zwavejs2mqtt) Docker manually built (please specify branches)
[ ] ioBroker.zwave2 adapter (please specify version)
[X] HomeAssistant zwave_js integration (please specify version)
[ ] pkg
[ ] node-red-contrib-zwave-js (please specify version, double click node to find out)
[ ] Manually built from GitHub (please specify branch)
[ ] Other (please describe)

Which branches or versions?

version: node-zwave-js branch: 10.3.1 zwave-js-ui branch: 8.6.2 HomeAssistant 2023.1.0 Z-Wave Integration 1.24.1

Did you change anything?

yes (please describe)

If yes, what did you change?

Added more devices to the network or toggle more devices at once. The lag also shows up when toggling smaller numbers of devices more often. For example, I have 3 devices in the kitchen associated together in z-wave. If I turn the main switch on they all come on with minimal delay, but if i then toggle it off again shortly afterward there is a few seconds of lag before the rest of the nodes respond, and subsequent toggles are worse still.

Did this work before?

No, it never worked anywhere

If yes, where did it work?

No response

Attach Driver Logfile

The incident to focus on occurs at 23:36:48 in the zniffer logs, 05:36:48 in the z wave logs. Z-Wave Logs: zwavejs_2023-01-13.log Zniffer Logs: zniffer_lag.zip Screenshot of Home Assistant logbook during the event: https://i.imgur.com/xGxodsC.png Position of the hub in my house: https://i.imgur.com/LyFc8Rk.jpg It's mounted in the top of a closet, 12 powered nodes within 15ft of it, on a USB 2.0 (intentionally not 3.0+) extension connected to an odyssey blue mini PC running off POE.

ColColonCleaner commented 1 year ago

I've gone through all the bullet points in the troubleshooting guide, and have already tried tuning according to those. Many network heals, individual heals, node health checks, reducing reporting. Things seem really solid on an individual node level. All the SNR margins are good, getting nice healthy reports from all the powered nodes. But still, interacting with them at anything faster than a snail's pace results in massive lag spikes, it's nuts. I really hope there's a way to make this local system very responsive because if it takes 30 seconds to turn all the lights off in the house I'm losing one of the major benefits of having a local controller.

ColColonCleaner commented 1 year ago

A hacky way I've avoided some of this is using home assistant's scripts to disperse commands over a longer period of time so z-wave can keep up. One example is someone saying 'all lights off' to one of the google speakers. If i hijack that command and reroute to a script that loops over each light with a delay between them then this congestion and crashing doesn't happen, and the network can complete all the commands relatively quickly (10 seconds for all of them instead of 30-45).

What I'm wondering is, if z-wave as a whole can't handle more than a couple requests per second, wouldn't it be good to have a rate limiter available that we could configure to avoid this happening altogether? An option in the JS controller akin to 'queue incoming commands and increase the delay between commands by X milliseconds for each new one, up to Y milliseconds limit, and reset the delay to zero after Z seconds of no incoming commands' would be a really nice thing to have so we can avoid this congestion without hacking it together with automations.

I really hope there is a way to avoid this altogether because over the last few months trying to implement this my impression of z wave has been that it's very fragile even in the best environment, and spent many frustrating evenings trying to debug things.

AlCalzone commented 1 year ago

Your logfile is on loglevel "silly", not "debug" as requested by the issue template. Any chance you can redo just the driver log with the correct loglevel? "silly" adds so much noise, this becomes almost impossible to read.

AlCalzone commented 1 year ago

Nevermind, I've truncated the log with some RegEx magic: zwavejs_2023-01-13.log

ColColonCleaner commented 1 year ago

Apologies. I'll change the log level to Debug for future posts. Thank you for taking a look at this!

If you want me to send a fresh reproduction of the issue with new logs and zniffer i can do that as well.

EDIT: I saw that Debug level was requested mentioning other levels didn't provide enough information. I was already looking at them on Silly so since that had more info than debug I thought that was good. My bad.

AlCalzone commented 1 year ago

So, here are some of my observations:

Most (if not all) of these commands are sent with Supervision, which means the communication with devices is supposed to look like this:

Controller -> Node: Turn off and tell me if it worked
Node -> Controller: it worked

2023-01-14T05:36:48.751Z DRIVER » [Node 074] [REQ] [SendDataBridge]
                                  │ source node id:   1
                                  │ transmit options: 0x25
                                  │ callback id:      102
                                  └─[Security2CCMessageEncapsulation]
                                    │ sequence number: 68
                                    └─[SupervisionCCGet]
                                      │ session id:      20
                                      │ request updates: true
                                      └─[MultilevelSwitchCCSet]
                                          target value: 0
                                          duration:     default
...
2023-01-14T05:36:48.894Z DRIVER « [Node 074] [REQ] [BridgeApplicationCommand]
                                  │ RSSI: -79 dBm
                                  └─[Security2CCMessageEncapsulation]
                                    │ sequence number: 82
                                    └─[SupervisionCCReport]
                                        session id:          20
                                        more updates follow: false
                                        status:              Success
                                        duration:            0s

This response (SupervisionReport w/ status Success) makes all status updates unnecessary, since it implies that the target value has been reached.

However this already goes wrong after the 2nd command to node 88. It replies with success, but while Z-Wave JS sends the next command to the next node, this happens:

2023-01-14T05:36:49.140Z DRIVER « [Node 088] [REQ] [BridgeApplicationCommand]
                                  │ RSSI: -85 dBm
                                  └─[Security2CCMessageEncapsulation]
                                    │ sequence number: 251
                                    └─[SupervisionCCGet]
                                      │ session id:      50
                                      │ request updates: false
                                      └─[MultilevelSwitchCCReport]
                                          current value: 0
                                          target value:  0
                                          duration:      0s

In other words, it tells us that it is at value 0 (we already know) - but even worse, it requests a status update that we got that command. Suddenly each turn-off command no longer needs 2 messages, but 4.

As far as I can see, this happens for nodes 88, 150, 160, 74 (with a bit of delay), 154, 161 and on ~30 other occasions, not sure if all of those belong to the one switching attempt.

What's even worse, some devices become impatient and repeat this if they don't get a response fast enough, e.g. Node 154:

2023-01-14T05:36:50.013Z DRIVER « [Node 154] [REQ] [BridgeApplicationCommand]
                                  │ RSSI: -86 dBm
                                  └─[Security2CCMessageEncapsulation]
                                    │ sequence number: 217
                                    └─[SupervisionCCGet]
                                      │ session id:      2
                                      │ request updates: false
                                      └─[MultilevelSwitchCCReport]
                                          current value: 0
                                          target value:  0
                                          duration:      0s

...didn't get the response after 400ms (!):
2023-01-14T05:36:50.411Z DRIVER « [Node 154] [REQ] [BridgeApplicationCommand]
                                  │ RSSI: -83 dBm
                                  └─[Security2CCMessageEncapsulation] [INVALID]
                                      error: Duplicate command

... and once more after another 600ms:
2023-01-14T05:36:51.039Z DRIVER « [Node 154] [REQ] [BridgeApplicationCommand]
                                  │ RSSI: -83 dBm
                                  └─[Security2CCMessageEncapsulation] [INVALID]
                                      error: Duplicate command

... and once more for good measure after another 500ms:
2023-01-14T05:36:51.569Z DRIVER « [Node 154] [REQ] [BridgeApplicationCommand]
                                  │ RSSI: -83 dBm
                                  └─[Security2CCMessageEncapsulation] [INVALID]
                                      error: Duplicate command

... and hey, why not again after another 500ms:
2023-01-14T05:36:52.027Z DRIVER « [Node 154] [REQ] [BridgeApplicationCommand]
                                  │ RSSI: -83 dBm
                                  └─[Security2CCMessageEncapsulation] [INVALID]
                                      error: Duplicate command

Then you've got Node 151, which reports a status change without even being turned off yet:

2023-01-14T05:36:49.429Z DRIVER « [Node 151] [REQ] [BridgeApplicationCommand]
                                  │ RSSI: -87 dBm
                                  └─[Security2CCMessageEncapsulation]
                                    │ sequence number: 124
                                    └─[SupervisionCCGet]
                                      │ session id:      29
                                      │ request updates: false
                                      └─[BinarySwitchCCReport]
                                          current value: false
                                          target value:  false
                                          duration:      0s

Again, with a confirmation request.

And again, this time without confirmation:

2023-01-14T05:36:49.969Z DRIVER « [Node 151] [REQ] [BridgeApplicationCommand]
                                  │ RSSI: -88 dBm
                                  └─[Security2CCMessageEncapsulation]
                                    │ sequence number: 125
                                    └─[BinarySwitchCCReport]
                                        current value: false
                                        target value:  false
                                        duration:      0s

And again a second later:

2023-01-14T05:36:50.904Z DRIVER « [Node 151] [REQ] [BridgeApplicationCommand]
                                  │ RSSI: -88 dBm
                                  └─[Security2CCMessageEncapsulation]
                                    │ sequence number: 126
                                    └─[BinarySwitchCCReport]
                                        current value: false
                                        target value:  false
                                        duration:      0s

The actual turn off request only happens 30s later:

2023-01-14T05:37:26.880Z DRIVER » [Node 151] [REQ] [SendDataBridge]
                                  │ source node id:   1
                                  │ transmit options: 0x25
                                  │ callback id:      131
                                  └─[Security2CCMessageEncapsulation]
                                    │ sequence number: 45
                                    └─[SupervisionCCGet]
                                      │ session id:      44
                                      │ request updates: true
                                      └─[BinarySwitchCCSet]
                                          target value: false

This could mean that you have some commands forwarded through associations, which cause repeated unnecessary reports here.

Also, you seem to be triggering many of these commands twice, specifically Nodes 74, 87, 88, 112, 150, 154, 155, 156, 157. While the driver should be deduplicating those, it currently does not, so this causes roughly 9*4 = 36 unnecessary commands to be transmitted.

And on top of that, not all of your nodes seem to have a solid connection. For example, here it took over 15 attempts over 10s to reach Node 154:

2023-01-14T05:37:00.706Z DRIVER » [Node 154] [REQ] [SendDataBridge]
                                  │ source node id:   1
                                  │ transmit options: 0x25
                                  │ callback id:      116
                                  └─[Security2CCMessageEncapsulation]
                                    │ sequence number: 5
                                    └─[SupervisionCCGet]
                                      │ session id:      31
                                      │ request updates: true
                                      └─[MultilevelSwitchCCSet]
                                          target value: 0
                                          duration:     default
...
2023-01-14T05:37:12.349Z DRIVER « [REQ] [SendDataBridge]
                                    callback id:            116
                                    transmit status:        OK, took 11440 ms
                                    repeater node IDs:      87
                                    routing attempts:       15
                                    protocol & route speed: Z-Wave, 100 kbit/s
                                    ACK RSSI:               -68 dBm
                                    ACK RSSI on repeaters:  -86 dBm
                                    ACK channel no.:        0
                                    TX channel no.:         0
                                    route failed here:      80 -> 154

and then again 3s:

2023-01-14T05:37:12.355Z DRIVER » [Node 154] [REQ] [SendDataBridge]
                                  │ source node id:   1
                                  │ transmit options: 0x05
                                  │ callback id:      117
                                  └─[Security2CCNonceReport]
                                      sequence number:  6
                                      SOS:              true
                                      MOS:              false
                                      receiver entropy: 0x15e9dc0d22338e1fb643226a80dcdd3d
...
2023-01-14T05:37:15.404Z DRIVER « [REQ] [SendDataBridge]
                                    callback id:            117
                                    transmit status:        OK, took 2980 ms
                                    repeater node IDs:      81
                                    routing attempts:       6
                                    protocol & route speed: Z-Wave, 100 kbit/s
                                    ACK RSSI:               -91 dBm
                                    ACK RSSI on repeaters:  -87 dBm
                                    ACK channel no.:        0
                                    TX channel no.:         0

Now this might be a side effect of the high traffic, but I'd look into that.

TL;DR: I think your main issue is too much unnecessary traffic. RF is a shared medium, so nodes (and the controller) need to take turns talking. If there is too much noise/traffic from other nodes they will have problems reaching each other, will need to retry or change routes - all causing delay. Unfortunately, the 700 series controllers seems to be much more affected by this than 500 series. Since all commands are queued and may only be sent after the previous cycle was completed, these problems will slow down the command throughput... up to the point where you have 10-15s delays because one node is not responsive.

Things I'd do:

Figure out why some of these commands get triggered twice and fix that
Check if you can turn off Supervision for the nodes' status updates. Some devices let you do that via config parameters. Which devices are these btw.? Ideally the manufacturer should make sure not to send the update after already confirming success.
Check if you can turn off re-transmission of un-acknowledged status reports
Figure out why Node 151 seems to get turned off multiple times before receiving our commands
Look into possible connectivity issues of Node 154.

Also, where are you located? This doesn't seem as bad as some results I've seen with the Z-Stick 7 - so probably US? The EU version has range issues, so that might affect throughput aswell. Anyways, using a different stick (probably 500 series) may be worth a try if nothing else helps. You can migrate without having to re-include using Z-Wave JS UI and the NVM backup/restore feature.

AlCalzone commented 1 year ago

Just out of curiosity: You've turned off 23 nodes (if I didn't miss any), so this entire sequence should take 46 commands to be completed (23 outgoing, 23 success reports). In your log, I count 81 outgoing commands, and 81 incoming commands, so ~3.5x the expected amount.

firstof9 commented 1 year ago

@ColColonCleaner May need a list of all light entities to determine if duplicates are being thrown in, in HA go to Dev-Tools->Templates and enter: {{ states.light | map(attribute='entity_id')|list }}

Please post the output from HA.

ColColonCleaner commented 1 year ago

Thank you so much for the detailed examination! I'll have a response soon, been busy.

ColColonCleaner commented 1 year ago

Still compiling a bunch of results and trying different things based on what you guys said, along with some other homegrown solutions. Doing that in tandem with taking videos for the manufacturer of these devices so it's taking some time.

ColColonCleaner commented 1 year ago

Thank you so much for the detailed breakdown and for helping me. It's been very frustrating working on this, trying to figure it out myself, it's lovely to have an expert take a look.

Responding to your points:

I'm in the US, yes. Used both the Zooz 700 stick and the Aeotec 7 stick with negligible differences noticed. Actually ordered the A7 because I was having so many issues with the Z700, but they continued. I had an Aeotec Gen5+ stick on the way when I posted this and have since done some testing on that (thank you for the NVM backup/restore between series, that’s a very sweet feature).
- Testing the A5+ stick was a massive failure. After importing the NVM backup to that stick all of the nodes in my network had their latency increase by 5x at least, upwards of 1500ms rtt, even after multiple heals. The throughput on the network dropped to a crawl and actions would take seconds to execute sometimes, even if the target was the only node being interacted with. I’m not sure what issue was going on there but after a day of messing around trying to resolve it I swapped back to the Z700 stick.
75 of the 78 nodes in my network are Zooz devices. Mixture of everything, dimmers, relays, and sensors (both battery and powered). I have definitely noticed some of these issues being rooted in Zooz’s firmware. The other 3 devices are Yale locks, which after understanding how they are configured have been fine overall.
Commands being triggered early/twice. Two causes.
- One I found because of @firstof9 ’s callout (thanks for that), there was a ‘group’ light that had multiple lights in it. I’ve removed that but it wasn’t part of the original problem, just part of my artificial recreation. The original problem I noticed was someone using the google assistant to turn off all lights, for some reason I used HA to recreate it instead of using the same assistant method. The group light wasn’t exposed to google so it didn’t apply to the original problem.
- The second is because of zwave associations. There are lights associated with each other that will mirror each other’s state. Groups are 151-150, 82-87-162, 87-114 (fan), and 155-161. However, if someone tells the lights to go out, they all get that command from the controller directly. I tried including just a single device and hiding the rest from people’s direct use, but I found that during command spam the associations would sometimes never complete their commands. Only the ones I targeted with the controller would for sure complete, whereas the association commands tometimes wouldn’t complete even after waiting for a long time. Outside of congested times the associations worked, and are quite snappy. Since commands from the controller would always eventually work, I target them all. Associations are cool and they seem faster than using automations so I want to keep using them. It would be unfortunate if one of the features I was excited about when I realized what it was turns out to be unusable. This another firmware thing? The devices not confirming that association commands are processed?
State Changes / Supervision
- Does the state of a device change in HA or JSUI when the command is acknowledged? Or does it happen when the state change event comes from the device? Both? That’s something I could argue to Zooz if they are doubling up.
- You’re right, if the state is already swapped when the command was confirmed then there isn’t a need for that status update. Keeping supervision on for all other scenarios seems vital though since commands drop relatively easily when congestion increases so supervision is the only thing getting them completed at that point.
- Disabling re-transmission of unacknowledged status reports seems like a bad thing to do. Maybe increasing the wait period before retransmission but not disabling it. If a device changed state/status, I want the controller to know that, right? Otherwise we’re back to polling.
Node 154
- I don’t think node 154 could be in a more ideal position. It’s about 5 feet from the controller. There is one wall between them (no power in it) and open air otherwise. And the SNR Margin is +40 dBm, which according to the health check means 10/10 quality. This is likely just due to the congestion then since it’s so close to the heart of the traffic? It’s rather bad-form from a testing perspective, but my zniffer is connected to a PC that’s in a different room than the primary controller, so the RSSI values are likely different between them. I have a spare USB port on the mini-pc running the main controller though, so if i took the time to set up a virtual COM port over ethernet I could potentially move the zniffer directly next to the main controller but I haven’t taken the time to set that up.

General Questions:

I don’t see a way of automating heals of the network in a reasonable fashion. There are options in JSUI to heal the network, or heal individual devices, but I don’t see a way for me to automate heals. There isn’t an option in the zwavejs integration to send a heal request. This would be nice so I could automate that process for problematic nodes. You mentioned that you’d start implementing application level routes, which would also solve this issue and be preferred to automated heals.

My Homegrown Solution:

I’ve re-created all of my lights (and soon all of my fans) as template entities in HomeAssistant. This allowed me to override their turn_on/turn_off/set_level calls, so when they are called it puts their actions in a global queue for execution with a dynamic delay. 5 second window of monitoring. If 3 calls happen within that window, add a 250ms delay between all subsequent calls. After no calls for 5 seconds, reset the delay to zero. This has completely removed the crashing problem, although it was quite the effort to set up.

Closing thoughts:

Could there be some way in the controller to queue up and rate limit outbound commands if we KNOW doing many of them at once will cause issues? It would allow people to tinker with settings, improve their network, figure out if firmware is the cause, etc, without an issue like this crippling things. I like to tinker, I like to improve things, but not when I'm stressed out about the whole thing not working properly as the buyer's remorse looms. Slightly slower but steady is much preferred to rocketing out the gate and then tripping over your own shoes because the shoes have terrible firmware and took 6 steps instead of 1.

Making things run smoothly is my goal, no crashing/dropouts, or gaps in commands because things went haywire. Speed and efficiency can come with tinkering and emails to Zooz, hopefully. But I’ve seen so many reports of congestion-related issues on 700 series that having the option to rate limit so the system doesn’t hang itself in a panic every time there is a burst of outbound commands would be really good. Like you said RF is a shared medium and providing a configurable rate limiter to make sure that sharing happens (even with the awful firmware on some devices) would be wonderful.

firstof9 commented 1 year ago

You should be using multicast.

ColColonCleaner commented 1 year ago

@firstof9 just confirming you are referring to multicast DNS, a networking feature. I have it enabled on my router, but that's completely unrelated to zwave unless I've missed a setting somewhere. What are you referring to?

firstof9 commented 1 year ago

No zwave multicast to turn off many devices of the same type at once. service: zwave_js.multicast_set_value

EDIT: caveat it doesn't work with S2 security yet.

ColColonCleaner commented 1 year ago

@firstof9 All of my devices are S2, so that's not an option for me. Also i don't think that works with assistants, which send individual requests for each of the devices they are operating on and is out of my control. If I was running automations myself then sure but that doesn't apply to assistants. I could make some kinds of catching mechanism to bundle them together after the fact but the amount of effort required for that shouldn't be on individual implementers since every person would need their own homegrown solution. If the controller receives a burst of requests of the same type, they should automatically be bundled as a multicast, assuming that feature was fixed to work on S2.

firstof9 commented 1 year ago

If the controller receives a burst of requests of the same type, they should automatically be bundled as a multicast

That would be @AlCalzone 's department, I'm not even sure the driver looks at things like that.

ndoggac commented 1 year ago

I'm seeing the same with A7 stick. Whether a large number of entities are commanded from within a script or from a node-red flow (call-service with all queued messages) the network falls on its face. Devices actually become permanently unresponsive, a power cycle of the device itself or restart of HA with unplugging USB stick does not help. Only remedy is exclusion/inclusion of the individual devices. Wondering if zwave-js has its own buffered queue where we could limit message rate to the chipset / RF network? Thinking perhaps the chipset itself has some sort of message rate limit, but not sure how it could permanently make a device unresponsive to zwave commands. Even when unresponsive, device status shows as good in UI. Health checks show up 0/10 for 5 rounds though.

44 devices, mix of 24 plus and 20 non-plus, no security implemented though.

ColColonCleaner commented 1 year ago

@ndoggac Damn you're having this issue even without S2? That's so bad. I thought the majority of my issues were because I had everything included securely and that has increased overhead, but i wasn't about to run without security.

When first starting down the zwave path i saw people mentioning it has less bandwidth than other systems and my internal reaction to that was "oh, that's fine, I'm not trying to send video or images over that network so we're all good" Looking back now that thought is hilarious because trying to send a single image over zwave would probably take a week and hang every device during the transfer.

Needing to slow down operations on a local install like this feels so wrong. One of the upsides of a local install is supposed to be massively improved response time, no internet latency involved. But with zwave when actuating a bunch of devices it's actually slower than cloud devices would be. And all of the devices cascade to their new state instead of all changing immediately. It's not at all what I expected when starting this setup and the buyer's remorse is insane. So much time debugging things, so many late nights fighting with weird problems. Finding out doing more than 5 commands in a second can cripple the network. Or energy monitoring devices reporting every few seconds while the connected device is changing state can make the locks on my doors miss their updates. Buying 3 different controllers and a 4th to destroy and convert to a zniffer (because apparently the main controller can't just, you know, output the raw packets it's sending/receiving for some inexplicable reason) in a desperate attempt to improve how things worked. Zooz firmware that makes all the dimmers flash and flicker whenever they are healed.

I have tried so hard to make this work well. And I'm a software engineer by trade, I can't imagine what this would have been like for a regular Joe just wanting a locally run install. I usually love tinkering and making things work exactly how I want them, but trying to implement zwave let alone tinker on it has been a living nightmare.

None of this is the fault of zwavejs/zui. Y'all have been wonderful and so helpful as I tried to navigate all this. I see it's a passion project and I respect that a lot. I'm glad for those who could get zwave as a whole working the way they wanted, but I don't think I can. I'm so exhausted, burned out.

neil1111 commented 1 year ago

@ColColonCleaner I’m incredibly impressed with the thoroughness of your approach and documentation of this problem. I have about 40 zwave devices, all but 1 are zooz, using a zooz 700 series hub. I’ve had mixed results even getting many of them to connect by S2, even though that’s exactly why I went down the path.

I have a number of ZEN31s controlling ww/cw LEDs throughout the apartment, and those seem to be particularly problematic of either dropping commands or executing them 30 or more seconds late. I also regularly lose synchronicity of their states with HA. All my automations are via NR.

Two questions:

Are you seeing problems with ZEN 31s too?
Would you mind explaining further your “home grown solution”, above?

I’m happy to try to help chase down this problem to drive better reliability. Thanks! Btw, I’m @ha_tinkerer on discord.

ColColonCleaner commented 1 year ago

@neil1111 Don't have any ZEN31 devices so can't help with that one. No house RGB to control so they weren't on my list. The best devices I've seen from Zooz are their scene controller, their basic ZEN71 toggle switch, and their tiny water leak sensor. Avoid ZSE18 like the plague.

I'm currently in the process of biting the bullet I hoped i wouldn't have to and taking almost all the powered devices out of my walls and returning them to Zooz for refunds. I'm certain i won't get full refunds but I just can't stomach it anymore.

I got a bunch of Shelly relays and dimmers, their newer 'plus' ones, and installed them in a few of the rooms so far. The experience is night and day. I can toggle every single light in the house 10 times a second if I wanted to, who knows, maybe fun for a burglar alarm when away. And they all respond instantly, like they are all on the same physical circuit. Barely does my finger leave the button and every single device responds immediately. I know being wifi based these require a good router and coverage but I have a ubiquiti long range AP (300 client limit) so I'll be good for a long while and have room to expand. EDIT: It's important to have a separate VLAN for IoT devices in general so if something is compromised it has less of a chance of affecting your main network. I already had that set up so they are all isolated on their own network they can't call out of but home assistant can open connections into. EDIT 2: The reason i found Shelly in the first place was looking for energy monitoring solutions. Their energy monitoring relays are a pittance more expensive than the base versions, amazing stuff. Now I can see the live power consumption of any controlled device. Energy monitoring is a real thorn in zwaves side too because doing it well requires a bit of chattiness.

Sticking with zwave for scene controllers, battery powered sensors, and a few choice relays. Already had 3 aeotec range extenders making sure everything had coverage so we should be good on that front. Will need to heal everything to make sure they have updated routes, but that's just regular maintenance. It seems that zwave is nicely suited for that kind of network. Occasional sensor updates, temperature, humidity, etc, and taking user input and passing it to something else to act on. The cats are still being fed by zwave relays too, lol. Both of their autofeeder clocks were always drifting away from real time so i opened them up and soldered leads to the circuit boards for full control, haha. Works great. But once again that's a sanitized, scheduled, occasional operation that seems really suited to zwave.

Something that baffles me though is why the ubiquitous "put the controller on a usb extension" became a thing. Wondering why nobody has made a commercial zwave hub with an antenna connector so people could install their own high gain antenna away from the controller. At least that I've seen, there may be one for the RPI. We could run the antenna to the ceiling or higher and always get reports, even from the faintest of nodes or respond amongst congested traffic. I'm not versed in RF though so there is probably something preventing a simple solution like that.

Once again, not the fault of zwavejs, just the chip in these controllers and the physical devices. ZJS/ZUI are doing great things and I 100% support the endeavors here.

As for the 'homegrown solution' I'll post the details soon once I can compile it into a nice format. Definitely want to make that available for people running into this problem. It works, stabilizes the network, it's just a compromise I wasn't happy living with given the investment.

P.S. Shelly makes an RGB controller.

AlCalzone commented 1 year ago

@ColColonCleaner I promised you some answers, so here we go:

First off, I assume we're talking about Zooz devices here? Even if you're getting rid of some devices already, I'd like to try and work with them to improve the situation - this isn't the first time I've seen problems like yours, but there's nothing to do on the controller side, except try to respond to the message flood in the best way possible.

Associations are cool and they seem faster than using automations so I want to keep using them. It would be unfortunate if one of the features I was excited about when I realized what it was turns out to be unusable.

I get that and this should ideally be fine. Can you share a bit of information about this setup? Which devices control which using which association group? Typically there's a distinction between physically operating the devices and forwarding Z-Wave commands when it comes to associations. In your case you'd only want the devices to control the associated ones when you toggle the switches, and in that case it is absolutely expected that the controlled devices report their new status. When controlled via Z-Wave, this "forwarding" should ideally not happen, unless you set it up that way (or the device doesn't distinguish here).

Does the state of a device change in HA or JSUI when the command is acknowledged? Or does it happen when the state change event comes from the device? Both? That’s something I could argue to Zooz if they are doubling up.

Both. Commands can be sent unsupervised or supervised. When unsupervised, we only get the acknowledgement that the node received the command, not that it understood it. When supervised, we also get a response that the device understood the command and whether it executed it. In this case we know if the state change was done, and possibly how long it will take.

Z-Wave JS updates the state in the following situations:

The command was sent unsupervised, the device acknowledged that it got the message, and the application has enabled "optimistic updates" in Z-Wave JS (which I think HA does). In this case, Z-Wave JS schedules a verification GET a couple of seconds later to verify that the change was done.
The device reports its updated state. Any pending verification GET is canceled.
The device responds to a verification GET.
The command was sent supervised and the device responded that the change was successful.

Disabling re-transmission of unacknowledged status reports seems like a bad thing to do. Maybe increasing the wait period before retransmission but not disabling it. If a device changed state/status, I want the controller to know that, right?

Theoretically I agree. But with S2, there's also a way to at least know if the target node could decrypt the command without relying on Supervision:

Send encrypted command
Wait for protocol-level acknowledgement, if not retransmit (this is done anyways)
Wait 500 to 1000ms if an S2 nonce report is received.
- If not, it is very likely that the target understood the command (barring sporadical range issues). Then send the next message.
- If yes, re-transmit the command with the sender's nonce to re-sync the shared encryption state.

The latter approach has two downsides, but I'd argue both are negligible when trying to send the status update in response to a controller command:

It takes longer when sending multiple messages because of waiting for a possible nonce report. Using Supervision you can get around that. But how critical is this really that the controller receives the (unnecessary!) update 500ms earlier?
It is not 100% certain that the target understood the command (acknowledged, but nonce report lost). However when responding to a controller command, the controller can just query the state if it is missing a state update.

So IMO this strategy is better than what happens currently on the nodes:

do not send state updates in response to supervised controller commands
consider the wait-for-nonce strategy for less critical updates instead of using Supervision
increase wait time between re-transmitting supervised commands
lower amount of retries for non-critical updates (for example, it's really not that important that every single power meter reading every 15s reaches the controller. trying once or maybe twice is enough)

The upside of this is big though: The controller will be the one controlling the communication flow, not end devices which flood the network because they think their state update is super important now and needs to be responded to immediately.

ColColonCleaner commented 1 year ago

@AlCalzone

Yes they are almost all zooz devices. Overall the reception from their support has been great but I've had so many issues implementing this that they offered me refunds (some partial) even though I'm outside the normal return window. I've logged a bunch of cases with them along with videos and logs for a variety of issues. The logged issues don't even include this one because it seemed like something I should nail down the details of with zwavejs first before bringing it up to them. But since it seems like I'd never be able to get to the place using zwave where if I want every device in the house to actuate at once and every device responds instantly, I stopped pushing down that path. Because of the command queuing even in the ideal case zwave just doesn't seem to have the bandwidth for that. To actuate 25+ devices it's just not going to complete in less than a second no matter what i try, using security anyway. I've already pulled almost all the zwave switches and dimmers out of my walls and replaced them with shelly relays, so unfortunately I can't do any more large scale testing. My mental state around home automation has improved so much now that things respond as quickly as I originally expected them to with a locally hosted system. Don't have long term reliability stats on these yet, but they haven't missed a beat yet.
Associations. I would expect the associations to fire regardless of why the device changed state. If a device is supposed to mirror its state to another device i don't think there should be any distinction between how the state changed. Things could become out of sync if that was the case.
Commands/Acknowledgement. I really wish there was a standard that every device manufacturer followed for this. It seems that if the formula for messaging isn't followed to the letter and command count kept to an absolute minimum that things fall apart quickly. I saw on a couple aeotec devices i got they included some tweaking parameters for command retransmission which is neat, but seeing those there at all really made my stomach sink because the reason they need to be there seems to be an inherent problem with the protocol itself.

Thank you for taking the time to respond with these details.

@neil1111 I'll have that setup hopefully documented soon. It's been all the spare time I have getting this conversion done so I can send the devices back in a timely manner.

ColColonCleaner commented 1 year ago

Got refunds for almost all the hardwired zooz devices. With restocking fees they gave back about 75% of what I paid overall. I'm content with that, and very glad that stressful chapter in my life is behind me.

Pulling all of those out has made the remaining 40 zwave devices a lot more reliable and snappy since there is almost nothing happening on the network at any given point other than a couple sensor updates, or a scene controller event that gets handled outside the network so at most requires a single follow up command.

I have a DND session to run tomorrow night, so I'll dedicate Wednesday to documenting the queued lighting.

ColColonCleaner commented 1 year ago

Also, I really look forward to utilizing the application priority routes Al has the pipeline. There are a few battery powered sensors that are draining their batteries really fast trying to communicate directly to the controller rather than via the repeater a few feet from them.

zwave-js-assistant[bot] commented 1 year ago

This issue has not seen any recent activity and was marked as "stale 💤". Closing for housekeeping purposes... 🧹

Feel free to reopen if the issue persists.

zwave-js / node-zwave-js

Large amounts of lag when turning on/off multiple devices at once #5331

Is your problem within Home Assistant (Core or Z-Wave JS Integration)?

Is your problem within Z-Wave JS UI (formerly ZwaveJS2MQTT)?

Checklist

Describe the bug

Device information

How are you using node-zwave-js?

Which branches or versions?

Did you change anything?

If yes, what did you change?

Did this work before?

If yes, where did it work?

Attach Driver Logfile

Responding to your points:

General Questions:

My Homegrown Solution:

Closing thoughts:

How are you using `node-zwave-js`?