Closed amilanov75 closed 2 years ago
π Hey @amilanov75!
It looks like you attached a logfile, but its filename doesn't look like it a driver log that came from Z-Wave JS. Please make sure you upload the correct one.
I'm only seeing two control commands in the logs, after that its only the node-red info
2022-04-21T12:59:41.692Z NDERED Β« [Node: 12] [API: ValueAPI] [Method: getDefinedValueIDs]
2022-04-21T12:59:41.796Z NDERED Β» [Node: 12] [VALUE_ID_LIST] [DIRECT] Forwarding payload...
2022-04-21T12:59:41.798Z NDERED Β» [Node: 12] [VALUE_ID_LIST] [EVENT] Forwarding payload...
etc. etc.
so I'm not sure whats going on.
This seems to be the issue I talked with @marcus-j-davies yesterday evening though (optimistic value updates disabled, using Supervision), could this be?
Yeah it's the same one.
But since our convo - it seems to have added a variable - rate of messages.
But also @amilanov75, you have contradicted what you're saying.
Here, you say lights do turn off.
But then in discord.
OK. I'll raise it tomorrow. I'll also turn optimistic on and retest. I'll share the results, but remember that I sent a request for 0 and the light stayed on, optimistic or not.
So what is?
lights do turn off but no update back or lights don't turn off?
rate of messages.
Yeah I didn't mention this at first, because Supervision can help with managing a high steady flow of messages by avoiding the verification push or poll.
But these rates require the mesh to be in TOP shape. With a node directly next to the controller, I'm still getting round-trip times of ~25-30ms for a ping, which doesn't leave any room for transmission errors, interference bursts. Anything that requires a response is more in the range of 40-50ms, and if the messages are routed, this goes up.
It might also be worth capturing a log with the throttled rate of commands and one without - I.e full pelt.
We also have to be cautious that V9 (which is not used here) has changes with the internals of the message handling - not sure if it makes a difference here, but still.
Hi, The issue happens both ways:
βReply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.***>
Ok, so I did some stress testing - Turning 5x Dimmer2's on and off multiple times a second.
The messages in zniffer do change from direct to routed and the routing errors appear, but in the end the messages will all be processed and the lights continue to turn on and off and report state as expected.
You can see from the zniffer single messages to each light sent individually with a big break in between: file: 2022-04-22 5x Dimmers Direct 2022-04-22 5x Dimmers Direct.txt
You can also see what happens for non-throttled requests to rotate turning on and off at least once a second for more than 10 seconds. file: 2022-04-22 5x Dimmers Non-Filtered for more than 10 seconds 2022-04-22 5x Dimmers Non-Filtered for more than 10 seconds.txt
Once a device changes it's route from direct to routed, the only reliable way I can get it back to direct is to run the "Health Check" in the NR UI. Running "Heal Node" or Starting a "Network Heal" does not reliably retest the route to direct. I am not sure what happens in "Health Check", but it does seem to return the route to direct.
I have had to rename the extension on the zniffer files (zlf) to (txt) allow me to upload.
I'll do some tests on the Fibaro RGBW2 modules now and report back. Oh, i just realised, do you want the zwave logs associated with the above? If so, I'll re-run the tests and post the log.
So this is the same test for 3x RGBW2s
1) First the log with routing on the RGBW's: 2022-04-22 3x RGBW2 with routing.txt
2) Second, after the health check (1x round only) on each of the 3x Devices, routing is back to direct 2022-04-22 3x RGBW2 after health check to return to direct routes.txt
3) Finally, after repeated on/off cycles without throttling 2022-04-22 3x RGBW2 repeated on off per second.txt
This is the zwave log from step 3: zwave-js.log
So the above two examples are actually producing a different issue and may warrant a separate ticket, however I wanted to see what would happen with smaller volumes of devices.
Next I will complete the test with all 8x devices together. First I will run the health check to get them all back to direct. I'll save the zniffer log and zwave log when I am cycling non-throttled on and off commands.
PS: Znigger logs are renamed to .txt
The health checks for each of the 8x devices always comes back with 10 for all three results. The latency is always 20ms. (EDIT: I saw a couple of 30ms creep in on later Health Checks)
There is one exception, node 6 (Fibaro Dimmer 2) goes into a perpetual health check each time I run it. I have to abort the health check and restart the flow to stop the constant zwave traffic, so I have removed this from the following test (in case it is causing a different issue) which will now be made up of 4x Dimmer 2s and 3x RGBW2s.
Ok, so the original issue has not occurred, however the test conditions are not exactly the same as when I saw the issue. I've done this on purpose to build up to the final test.
These files below represent unthrottled on/off for the 7x devices mentioned above, however ALL the devices are on the same network.
For the final test I will add in devices from a 2nd network.
2022-04-22 3x RGBW2 and 4x Dimmer2 non-throttled.txt zwave-js.log
Regarding the routing vs. non-routing: Z-Wave JS doesn't control that, since routing is handled by the Z-Wave protocol which is implemented in the controller firmware. The routing algorithm is pretty straightforward AFAIK, but can lead to suboptimal (yet working) routes:
As long as a route works, the protocol will stick to it, even if it includes unnecessary hops. You can get around that by providing an application priority route, which will always be tried first. But this isn't implemented yet. During healing, new routes will be determined, which often gets the original non-hop LWR back, especially if it just failed temporarily. Again, the logic behind this lives in the controller firmware.
As for why the routes sometimes switch, I can only guess: temporary routing errors, e.g. by interference bursts causing an ACK to get lost, etc... The routed errors in the Zniffer logs indicate the latter. Here's the exact situation that introduced a hop in the route to node 22:
2022-04-22T12:31:04.893Z DRIVER Β» [Node 022] [REQ] [SendData]
β transmit options: 0x25
β callback id: 46
ββ[MultiChannelCCCommandEncapsulation]
β source: 0
β destination: 1
ββ[SupervisionCCGet]
β session id: 59
β request updates: true
ββ[MultilevelSwitchCCSet]
target value: 0
...
2022-04-22T12:31:04.924Z DRIVER Β« [REQ] [SendData]
callback id: 46
transmit status: OK, took 20 ms
routing attempts: 1
protocol & route speed: Z-Wave, 100 kbit/s
ACK RSSI: -54 dBm
ACK channel no.: 0
TX channel no.: 0
# All good until here
2022-04-22T12:31:05.028Z DRIVER Β» [Node 011] [REQ] [SendData]
β transmit options: 0x25
β callback id: 47
ββ[MultiChannelCCCommandEncapsulation]
β source: 0
β destination: 1
ββ[SupervisionCCGet]
β session id: 60
β request updates: true
ββ[MultilevelSwitchCCSet]
target value: 10
2022-04-22T12:31:05.216Z DRIVER Β« [REQ] [SendData]
callback id: 47
transmit status: OK, took 180 ms
routing attempts: 3
protocol & route speed: Z-Wave, 40 kbit/s
ACK RSSI: -57 dBm
ACK channel no.: 0
TX channel no.: 1
# The controller needed more attempts here and dropped to 40 kbps.
Comms with node 21 look fine, but:
2022-04-22T12:31:05.378Z DRIVER Β» [Node 022] [REQ] [SendData]
β transmit options: 0x25
β callback id: 49
ββ[MultiChannelCCCommandEncapsulation]
β source: 0
β destination: 1
ββ[SupervisionCCGet]
β session id: 62
β request updates: true
ββ[MultilevelSwitchCCSet]
target value: 10
2022-04-22T12:31:05.677Z DRIVER Β« [REQ] [SendData]
callback id: 49
transmit status: OK, took 290 ms
repeater node IDs: 19
routing attempts: 5
protocol & route speed: Z-Wave, 100 kbit/s
ACK RSSI: -55 dBm
ACK RSSI on repeaters: N/A
ACK channel no.: 0
TX channel no.: 0
# This time it took 5 attempts to reach node 22, also it is now routing. 300ms also smells like an explorer frame. The next communication with 22 now uses that route immediately, but takes 50-60ms:
2022-04-22T12:31:06.797Z DRIVER Β» [Node 022] [REQ] [SendData]
β transmit options: 0x25
β callback id: 52
ββ[MultiChannelCCCommandEncapsulation]
β source: 0
β destination: 1
ββ[SupervisionCCGet]
β session id: 2
β request updates: true
ββ[MultilevelSwitchCCSet]
target value: 0
...
2022-04-22T12:31:06.857Z DRIVER Β« [REQ] [SendData]
callback id: 52
transmit status: OK, took 50 ms
repeater node IDs: 19
routing attempts: 1
protocol & route speed: Z-Wave, 100 kbit/s
ACK RSSI: -55 dBm
ACK RSSI on repeaters: N/A
ACK channel no.: 0
TX channel no.: 0
Let me just stress again, this rate of communication isn't what Z-Wave is designed for. Even your dimmers start reporting that they can't handle your commands completely:
# First command sent
2022-04-22T12:31:06.592Z DRIVER Β» [Node 011] [REQ] [SendData]
β transmit options: 0x25
β callback id: 50
ββ[MultiChannelCCCommandEncapsulation]
β source: 0
β destination: 1
ββ[SupervisionCCGet]
β session id: 63
β request updates: true
ββ[MultilevelSwitchCCSet]
target value: 0
# First command executing
2022-04-22T12:31:06.713Z DRIVER Β« [Node 011] [REQ] [ApplicationCommand]
ββ[MultiChannelCCCommandEncapsulation]
β source: 1
β destination: 0
ββ[SupervisionCCReport]
session id: 63
more updates follow: true
status: Working
duration: 1s
# Second command sent
2022-04-22T12:31:06.942Z DRIVER Β» [Node 011] [REQ] [SendData]
β transmit options: 0x25
β callback id: 53
ββ[MultiChannelCCCommandEncapsulation]
β source: 0
β destination: 1
ββ[SupervisionCCGet]
β session id: 3
β request updates: true
ββ[MultilevelSwitchCCSet]
target value: 10
# Second command executing
2022-04-22T12:31:07.063Z DRIVER Β« [Node 011] [REQ] [ApplicationCommand]
ββ[MultiChannelCCCommandEncapsulation]
β source: 1
β destination: 0
ββ[SupervisionCCReport]
session id: 3
more updates follow: true
status: Working
duration: 1s
# First command aborted
2022-04-22T12:31:07.126Z DRIVER Β« [Node 011] [REQ] [ApplicationCommand]
ββ[MultiChannelCCCommandEncapsulation]
β source: 1
β destination: 0
ββ[SupervisionCCReport]
session id: 63
more updates follow: false
status: Fail
duration: 0s
Basically the dimmers get the next command before they are done with the previous one. I'm not surprised that this causes problems with state tracking.
Honestly, I'm not sure what you're trying to achieve here. If you want to make sure devices turn on at the same time, just use Multicast.
Yesterday I thought the issue was that the assumed state in Z-Wave JS wasn't correct after a supervised Set with optimistic value updates disabled, but this seems to be something else entirely.
I am adding 3x RGBW2s into the test. These are on another network and have slightly worse Health Check stats, one had 50ms response time. All three reports straight 7s.
So for this test there are:
Right, the test did not go to plan. The sync issue did not reappear, which makes me think that the node 6 device is the root cause of the issue.
Logs below: 2022-04-22 10x devices unthrottled.txt 2022-04-22 Kitchen-1881 zwave-js.log 2022-04-22 Cellar-1882 zwave-js.log
I will reset everything and retest. Maybe the network needs to have lots of routes before the issue happens....
EDIT: It's happening again. I'll setup zniffer and drop the zwave.log file in. Give me a minute.
@AlCalzone
"Let me just stress again, this rate of communication isn't what Z-Wave is designed for. Even your dimmers start reporting that they can't handle your commands completely:"
I think we are not on the same page. You asked me to test "unthrottled". That is what I am doing.
Of course it's not a real life test. Did I misunderstand what you are asking of me?
The issue happens when I simply turn the lights on or off when the rate of messages is around 10. I thought zwave was designed to handle messages at at least 100ms apart (ie rate of 10 messages per second). In fact I would have thought it was designed for 50ms or maybe faster.
I've spend several hours now running these tests.
Please tell me exactly what test you want me to run and I'll do it.
"Yesterday I thought the issue was that the assumed state in Z-Wave JS wasn't correct after a supervised Set with optimistic value updates disabled, but this seems to be something else entirely."
Yes, you are right, this is new information but I think it's related.
Please just tell me exactly what test you want run and I'll do it. I can paste in logs where the state is going out of sync again, if that's that you want.
EDIT: and yes, I can design the code to reduce traffic by sending multicast messages. I am not sure if multicast works across devices as I've never used it before, so I could either reduce traffic from 11x message (in my prod setup) to 2x messages across both networks (if multicast is device agnostic) or 4x message (2x for rgbw2s and 2x dimmer2s). This will reduce the traffic and maybe the issue won't happen. I will look into this, but also need to know if you want further details on the sync issue.
EDIT2: I have tested again at a rate of 10 messages per second with "optimistic enabled" and the syncing issue is happening again.
The throttled vs. unthrottled log was because you mentioned in the first post that this is how to trigger the situation, but there is simply too much going on to identify what wrong.
Please tell me exactly what test you want me to run and I'll do it.
The minimum required scenario to trigger the state going out of sync. The less information to parse, the better. If it happens with turning 5 devices on or off at the same time, thats fine with me. But I need to know exactly what I'm looking for.
FWIW, multicast should be supported by all nodes, there's nothing special to it.
Ok, cool. thanks.
I was trying to give you more info than less, but let me post the zwave and zniffer log with it happening. I have set messages to a max of 10 per second...
Results to follow.
PS: At the end of this testing I will implement multicast. Thank you for the reminder to use it. I can reduce 11x messages down to 2x using it, which can only be a good thing.
Do you care if optimistic is on or off, as I have it currently disabled?
I can run with it both on and off if you want?
So the test is now with 11 lights.
Interestingly when turning 11x lights off, 3x of them reported incorrectly to still have a value of 15. I stopped the zniffer log and then the zwave logs, but when I came back to look at the current values, 2x of the devices updated to 0. This was more than 10 seconds later.
I did capture an image of the one remaining light (node 4 of network C4705755) which still to the point of writing shows a value of 15 in the NR UI, despite it actually being off.
Logs are here:
2022-04-22 Zniffer 11x devices - issue occuring.txt
2022-04-22 Cellar-1882 zwave1-js.log
2022-04-22 Kitchen-1881 zwave1-js.log
Apologies for the confusion above. I was being overzealous with the testing :)
Please advise if you want the same test with optimistic reporting disabled.
Ok this is a weird issue, but I think I know what's up. Both node 4 and 21 are toggled with supervision. Both report success, however node 21 re-transmits its messages over and over. There seems to have been another burst of interference at this point, since the message to node 8 which was sent at the same time took 1270ms and 8 attempts.
Since supervision is used, no verification of the state is necessary, so none is done and there is no value update. However, since node 21 re-transmitted its message, the later messages are interpreted as a new (unknown) supervision session. Because it is unknown, the driver does refresh node 21's status, resulting in an update.
TL;DR: value updating with supervision isn't working as intended.
Hi @AlCalzone, I'm glad you figured out where the issue is coming from. Cheers. Of course, if you need anything more from me, just shout out.
Hi @AlCalzone
Can you please tell me if the attached log is the same issue as reported above?
I am changing the colour of a light and yet after x attempts the colour either does not change or there is a long delay (~10 seconds or more) before the change actually appears.
Background for reviewing the file:
All I can see is what I would expect - all seems to happen within 1s But to add a disclaimer: I'm not the expert.
Flow OR UI Control sending the request
2022-04-27T14:12:57.965Z NDERED Β« [CONTROLLER] Payload received.
2022-04-27T14:12:57.965Z NDERED Β« [Node: 11] [API: ValueAPI] [Method: setValue]
ββ[params]
0: {"commandClass":51,"property":"targetColor","endpoint":1}
1: {"warmWhite":0,"red":255,"green":41,"blue":0}
Driver doing its thing.
2022-04-27T14:12:57.968Z SERIAL Β» 0x011700130b10600d0001330504000002ff032904003e252b5d (25 bytes)
2022-04-27T14:12:57.968Z DRIVER Β» [Node 011] [REQ] [SendData]
β transmit options: 0x25
β callback id: 43
ββ[MultiChannelCCCommandEncapsulation]
β source: 0
β destination: 1
ββ[ColorSwitchCCSet]
Warm White: 0
Red: 255
Green: 41
Blue: 0
Driver receving confirmation
2022-04-27T14:12:57.999Z SERIAL Β« 0x011800132b00000200de7f7f7f7f000003000000000301000002 (26 bytes)
2022-04-27T14:12:57.999Z SERIAL Β» [ACK] (0x06)
2022-04-27T14:12:57.999Z DRIVER Β« [REQ] [SendData]
callback id: 43
transmit status: OK, took 20 ms
routing attempts: 1
protocol & route speed: Z-Wave, 100 kbit/s
ACK RSSI: -34 dBm
ACK channel no.: 0
TX channel no.: 0
Update received in userland (the colour components)
2022-04-27T14:12:58.001Z CNTRLR [Node 011] [~] [Color Switch] currentColor[0]: 0 => 0 [Endpoint 1]
2022-04-27T14:12:58.001Z NDERED Β» [Node: 11] [VALUE_UPDATED] [DIRECT] Forwarding payload...
2022-04-27T14:12:58.001Z NDERED Β» [Node: 11] [VALUE_UPDATED] [EVENT] Forwarding payload...
2022-04-27T14:12:58.002Z CNTRLR [Node 011] [~] [Color Switch] currentColor[2]: 0 => 255 [Endpoint 1]
2022-04-27T14:12:58.002Z NDERED Β» [Node: 11] [VALUE_UPDATED] [DIRECT] Forwarding payload...
2022-04-27T14:12:58.002Z NDERED Β» [Node: 11] [VALUE_UPDATED] [EVENT] Forwarding payload...
2022-04-27T14:12:58.002Z CNTRLR [Node 011] [~] [Color Switch] currentColor[3]: 4 => 41 [Endpoint 1]
2022-04-27T14:12:58.002Z NDERED Β» [Node: 11] [VALUE_UPDATED] [DIRECT] Forwarding payload...
2022-04-27T14:12:58.002Z NDERED Β» [Node: 11] [VALUE_UPDATED] [EVENT] Forwarding payload...
2022-04-27T14:12:58.003Z CNTRLR [Node 011] [~] [Color Switch] currentColor[4]: 255 => 0 [Endpoint 1]
2022-04-27T14:12:58.003Z NDERED Β» [Node: 11] [VALUE_UPDATED] [DIRECT] Forwarding payload...
2022-04-27T14:12:58.003Z NDERED Β» [Node: 11] [VALUE_UPDATED] [EVENT] Forwarding payload...
2022-04-27T14:12:58.003Z CNTRLR [Node 011] [~] [Color Switch] currentColor: {"warmWhite":0,"red": [Endpoint 1]
255,"green":41,"blue":0} => {"warmWhite":0,"red":255,"green":41,"blue":0}
2022-04-27T14:12:58.004Z NDERED Β» [Node: 11] [VALUE_UPDATED] [DIRECT] Forwarding payload...
2022-04-27T14:12:58.004Z NDERED Β» [Node: 11] [VALUE_UPDATED] [EVENT] Forwarding payload...
2022-04-27T14:12:58.004Z CNTRLR [Node 011] [~] [Color Switch] targetColor: {"warmWhite":0,"red":2 [Endpoint 1]
55,"green":41,"blue":0} => {"warmWhite":0,"red":255,"green":41,"blue":0}
2022-04-27T14:12:58.004Z NDERED Β» [Node: 11] [VALUE_UPDATED] [DIRECT] Forwarding payload...
2022-04-27T14:12:58.004Z NDERED Β» [Node: 11] [VALUE_UPDATED] [EVENT] Forwarding payload...
2022-04-27T14:12:58.005Z CNTRLR [Node 011] [~] [Color Switch] hexColor: "0004ff" => "ff2900" [Endpoint 1]
2022-04-27T14:12:58.005Z NDERED Β» [Node: 11] [VALUE_UPDATED] [DIRECT] Forwarding payload...
2022-04-27T14:12:58.005Z NDERED Β» [Node: 11] [VALUE_UPDATED] [EVENT] Forwarding payload...
You then move on to change the value to something different (and all the same things happen)
2022-04-27T14:13:01.675Z NDERED Β« [CONTROLLER] Payload received.
2022-04-27T14:13:01.675Z NDERED Β« [Node: 11] [API: ValueAPI] [Method: setValue]
ββ[params]
0: {"commandClass":51,"property":"targetColor","endpoint":1}
1: {"warmWhite":0,"red":255,"green":142,"blue":0}
If the above is correctly described - then the actual change seems to be something happening locally on the device, the logs above show nothing is being delayed driver side.
But again - I'm no expert at the internals
Hi @AlCalzone What is the likely turn around time for this issue?
Can you please tell me if the attached log is the same issue as reported above?
It's not, these commands aren't using Supervision. I'm also not seeing any communication delays like @marcus-j-davies
@AlCalzone Do you have a release date for this?
no
Hi @AlCalzone
Any thoughts yet on a fix for this bug?
I'll tackle this as part of the v10 release, which I've been working on for a while.
Good man!Thank you.Β On 29 Jun 2022 17:21, AlCalzone @.***> wrote: I'll tackle this as part of the v10 release, which I've been working on for a while.
βReply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.***>
Is your problem within Home Assistant (Core or Z-Wave JS Integration)?
NO, my problem is NOT within Home Assistant or the ZWave JS integration
Is your problem within ZWaveJS2MQTT?
NO, my problem is NOT within ZWaveJS2MQTT
Checklist
[X] I have checked the troubleshooting section and my problem is not described there.
[X] I have read the changelog and my problem was not mentioned there.
Describe the bug
What causes the bug? When I increase the frequency of messages per second to more than ~10 (i.e. every 100ms) and definitely at 20 (i.e. every 50ms), syncing issues appear between the device actual state and the state reported in the Node-Red UI.
What do you observe? When changing the brightness or turning off a group of lights (e.g. around 5x Fibaro Dimmer 2s and 5x Fibaro RGBW2s) the reported state in Node-Red UI will not update, but the device is changing it's state.
What did you expect to happen? For example, when turning all 10x lights off, the Node-Red UI should report a Current Value = 0 against all lights. For one (or more) of the lights the value does not change from the current value e.g. 10. The light (or lights) do turn off. Repeatedly sending a payload = 0 will not update the NR UI.
Steps to reproduce the behaviour: By increasing the rate of messages per second to approx >10 per second, the issue occurs. By decreasing the rate of messages per second to approx <3 per second, the issue does not occur
Device information
Node ID = 11 Device = Fibaro RGBW2
As per images below: Payload = 0 Reported Value in NR UI = 10
How are you using
node-zwave-js
?zwavejs2mqtt
Docker image (latest)zwavejs2mqtt
Docker image (dev)zwavejs2mqtt
Docker manually built (please specify branches)ioBroker.zwave2
adapter (please specify version)HomeAssistant zwave_js
integration (please specify version)pkg
node-red-contrib-zwave-js
(please specify version, double click node to find out)Which branches or versions?
Version Information
Module Version 6.5.3 Driver Version 8.11.2 DB Version 8.11.2
Did you change anything?
no
If yes, what did you change?
No response
Did this work before?
Don't know, this is a new device
If yes, where did it work?
No response
Attach Driver Logfile
zwave-js.log