Interview failing at ProtocolInfo for nodes that already have been successfully included

MaKin211 commented 1 year ago

Hi everyone,

I am struggling with an issue I can just not solve. I used to have a Fibaro double switch 2 in my garden that was routed via a roller shutter node. It used to work flawlessly but then stopped due to a hardware defect. I wanted to replace it with another one but did not manage to include it. No matter what I tried. Even tested 2 other switches. With no luck. I then tried to include the switches right next (in the same room) to the controller (Z-Wave.Me USB Stick). Worked like a charm. I included all 3 to have them ready whenever I need them. The next day I installed the switch in the garden again but it appears dead on ZWave2MQTT. Tried another of the ones I included before. Same result…

I have no idea how to make the switches connect to the closest node. Before the first switch stopped working, it has been 100% reliable and nothing else changed.

Also tried healing the network, re-interviewing the nodes…

Two switches are now powered and are just waiting for the controller to say “Hi”… but for some reason they get stuck at the "ProtocolInfo" when trying to re-interview. I attached the log in silly mode.

Would really appreciate any help!

z-ui_2023-05-07.log

It is noded 47 and 48.

MaKin211 commented 1 year ago

Node 47 and 48 after powering them on. Node 49 is a node that is not powered. I just triggered the re-interview to compare between "real" nodes and offline ones. This is so confusing. :-(

robertsLando commented 1 year ago

Hi @MaKin211 ! Please provide driver log, loglevel debug and attach it here as a file (drag & drop into the text field).

The one you provided is application log file.

Also if possible try using an extension USB cable for your stick, that could fix the issue as it improves connectivity a lot

MaKin211 commented 1 year ago

Hi @robertsLando Thanks for your reply,

today in the morning I turned the power off and on again the the switches were both able to get interviewed correctly. An hour ago I power-cycled again to see whether they get back online and now they appear dead again.

All data was received correctly with the interviews succeeding.

Attached are the logs.

Btw.: I already use an extension cable and all the other nodes around my house (on 3 levels) work flawlessly.

logs.zip

Again, thanks! I really appreciate your time.

MaKin211 commented 1 year ago

I have two actors that I am experimenting with here. I just reset one of them in order to re-include it (after an failed attempt to exclude). Well, to no avail.

The next functioning wired node is approx. 6 meters away.

For some reason, they were able to get interviewed in the morning but after the power cycle they never appeared online again. "Last activity: never". :-(

Short extract from the logs:

2023-05-08 06:57:03.917 INFO Z-WAVE: [Node 048] Ready: Fibargroup - FGS223 (Double Switch 2)
2023-05-08 06:57:03.931 INFO Z-WAVE: [Node 048] Interview COMPLETED, all values are updated
2023-05-08 06:57:04.776 INFO Z-WAVE: [Node 048] Metadata updated: 91-0-scene-001
2023-05-08 06:57:04.777 INFO Z-WAVE: [Node 048] Metadata updated: 91-0-scene-002
2023-05-08 06:57:04.780 INFO Z-WAVE: [Node 048] Interview stage COMMANDCLASSES completed
2023-05-08 06:57:04.780 INFO Z-WAVE: [Node 048] Interview stage OVERWRITECONFIG completed
2023-05-08 06:57:04.781 INFO Z-WAVE: [Node 048] Interview stage COMPLETE completed
2023-05-08 06:57:04.781 INFO Z-WAVE: [Node 048] Interview COMPLETED, all values are updated
2023-05-08 06:57:05.575 INFO Z-WAVE: [Node 048] Value updated: 114-0-manufacturerId 271 => 271
2023-05-08 06:57:05.576 INFO Z-WAVE: [Node 048] Value updated: 114-0-productType 515 => 515
2023-05-08 06:57:05.576 INFO Z-WAVE: [Node 048] Value updated: 114-0-productId 4096 => 4096
2023-05-08 06:57:06.168 INFO Z-WAVE: [Node 048] Metadata updated: 91-0-scene-001
2023-05-08 06:57:06.168 INFO Z-WAVE: [Node 048] Metadata updated: 91-0-scene-002
2023-05-08 06:57:06.171 INFO Z-WAVE: [Node 048] Interview stage COMMANDCLASSES completed
2023-05-08 06:57:06.171 INFO Z-WAVE: [Node 048] Interview stage OVERWRITECONFIG completed
2023-05-08 06:57:06.172 INFO Z-WAVE: [Node 048] Interview stage COMPLETE completed
2023-05-08 06:57:06.172 INFO Z-WAVE: [Node 048] Interview COMPLETED, all values are updated
2023-05-08 06:57:07.038 INFO Z-WAVE: [Node 048] Value updated: 134-0-libraryType 3 => 3
2023-05-08 06:57:07.038 INFO Z-WAVE: [Node 048] Value updated: 134-0-protocolVersion 4.5 => 4.5
2023-05-08 06:57:07.038 INFO Z-WAVE: [Node 048] Value updated: 134-0-firmwareVersions 3.2 => 3.2
2023-05-08 06:57:07.039 INFO Z-WAVE: [Node 048] Value updated: 134-0-hardwareVersion 3 => 3

Finally coming to live, yay!

And then...

2023-05-08 10:29:32.934 INFO Z-WAVE: [Node 048] Is dead
2023-05-08 10:29:32.934 INFO Z-WAVE: Controller status: Scan completed
2023-05-08 10:29:32.934 INFO Z-WAVE: Network scan complete. Found: 24 nodes
2023-05-08 10:29:33.292 INFO Z-WAVE: [Node 032] Metadata updated: 50-1-value-66049
2023-05-08 10:29:33.292 INFO Z-WAVE: [Node 032] Value updated: 50-1-value-66049 0 => 0
2023-05-08 10:29:33.754 INFO Z-WAVE: [Node 032] Metadata updated: 50-1-value-65537
2023-05-08 10:29:33.754 INFO Z-WAVE: [Node 032] Value updated: 50-1-value-65537 0.69 => 0.69
2023-05-08 10:29:37.110 ERROR Z-WAVE: [Node 047] Interview FAILED: The node is dead
2023-05-08 10:29:43.391 ERROR Z-WAVE: [Node 048] Interview FAILED: The node is dead

AlCalzone commented 1 year ago

You'll want to look at the driver logs, not the application logs to figure out what's wrong. Those show that you definitely have a connectivity problem:

One communication attempt fails after 10 re-routing attempts because the node cannot be reached:

transmit status:        NoAck, took 7750 ms
routing attempts:       10
protocol & route speed: Z-Wave, 40 kbit/s

The next couple of attempts look like this:

transmit status:        OK, took 220 ms
repeater node IDs:      22, 24
routing attempts:       1
protocol & route speed: Z-Wave, 9.6 kbit/s
ACK RSSI:               -74 dBm

Not sure if those devices support it, but I see several other nodes communicating with 100 kbps. The 40 kbps that fail here are already a fallback if 100k isn't successful. The 9.6k attempt that finally succeeds can be seen as a last resort to get the data to the node somehow, but this is not a reliable connection (those attempts take 200-1000ms each). On top of that, you're using Security S0 to communicate with that node, which means for every actual command that gets exchanged, a total of 3 commands have to be sent back and forth. This means the communication is 3x as likely to fail than without S0.

Later attempts give a hint where your connectivity problem lies:

route failed here:      22 -> 24

I suggest following this guide and making sure all nodes on the way to the "problematic" ones have a solid connection: https://zwave-js.github.io/node-zwave-js/#/troubleshooting/network-health?id=testing-the-connection-strength You'll want to make sure that all measurements are good (compare values with what the guide says).

Another thing that stands out is that your log contains several meter reports with unchanged consumption values:

2023-05-08T00:54:00.513Z CNTRLR   [Node 030] [~] [Meter] value[66049]: 0 => 0                       [Endpoint 1]
2023-05-08T00:54:00.649Z CNTRLR   [Node 030] [~] [Meter] value[65537]: 1.09 => 1.09                 [Endpoint 1]

just to name one example. Granted, this is not as bad as I've seen before, just know that having too many of these reports can have a negative effect on your network. If possible, change those (and all other power meters) to report only on changes -> https://zwave-js.github.io/node-zwave-js/#/troubleshooting/network-health?id=optimizing-the-reporting-configuration

MaKin211 commented 1 year ago

Thank you very much for the insight and effort you spent for checking the logs. Let me try to address your points regarding the routing. I ran the health checks and find it hard to understand, let me explain.

Nodes 47 and 48 are the ones I want to connect. Without measuring the signal, I thought that node 46 is the closest neighbor. Nodes 22 and 24 however are also quite close. For that reason, I guess, 22 and 24 were chosen? But what does 22 -> 24 mean? Because 22 and 24 are both directly connected to the controller so why routing over 22 to 24 and not just 22 and/or 24? Or am I reading it wrong?

22 and 24 both have excellent health.

However, 46 has a very bad health report. Although it reacts immediately to changes so I do not notice any odd behavior like delays. What is even weirder is that the node 19 has a horrendous health report although it is a very reliable shutter and has not posed any issues whatsoever.

Can you help me bring light into the situation? And how can I improve?

The report issue is something I would like to address next, but I assume this is another thing, right?

MaKin211 commented 1 year ago

I added another node that is even closer to the one I am trying to connect. And honestly, I don't get it. I'd attest that the connection is reliable as it is pretty much next to another node that has a 9/10 rating.

I also re-included the S0 node without security.

Node 46 is now 51 (rating increased from 2/10 to 5/10 without any changes):

New node 50:

With the lost pings, why does only the latter show the reverse route from 50 -> 1 whereas node 51 does only show 1 -> 51?

For over a year I had no issues with Z-Wave devices outside my house. Then I even added two in-wall switches on the outside of my house (in wall) and instead of getting better, I cannot connect anything anymore. I am at my wits' end...

MaKin211 commented 1 year ago

So I've done some more research and tried the following:

Added another wall plug (node 36) as a repeater and plugged it into the in-wall switch (node 50)
Conducted health checks on nodes 36 and node 50
Conducted health checks on the repeating nodes (of nodes 36 and 50)

Health Check Node 50:

Health Checks for 22, 24 and 41:

Health Check Node 36 (plugged into node 50):

node 36 over 23 and 24 - plugged into node 50

Health Checks for 23:

So the preceding nodes look absolutely fine in their ratings and the health checks of 36 and 50 show now lost pings to the node but every ping back to the controller is lost. These preceding nodes are only meters away from the problematic nodes (same wall but on the other side).

I assume that for this reason, the next node I am trying to have in my network (garden) has no chance. What else can I try here?

AlCalzone commented 1 year ago

I agree that the information can be overwhelming a bit and isn't totally clear, so let's go step by step:

neighbors on the network map

This map is going to be replaced at some point. It currently only shows a reduced view of which nodes can "see" each other, but it is missing some of that information, e.g. which nodes on a "level" are connected. It does not give any indication of the link strength, it can be out of date and in some cases (probably in those barely connected situations) the information has been found to be plain wrong. TLDR: ignore it

I thought that node 46 is the closest neighbor

While this seems logical, radio signals don't work this way. In some cases a further away node may have a better connection than a closer one (interference, noise, ...).

Nodes 22 and 24 however are also quite close. For that reason, I guess, 22 and 24 were chosen? But what does 22 -> 24 mean? Because 22 and 24 are both directly connected to the controller so why routing over 22 to 24 and not just 22 and/or 24?

That's right. The routing algorithm in Z-Wave can sometimes be unintuitive. Especially when there are connectivity problems, it sometimes "brute forces" its way through the network and then sticks with a route that works, even if it is a bad route (better than no connection though). If you want to learn more, here's a good video.

In your case, the chosen route was actually going through 22 and then 24 (why ever), and that link seems suboptimal. Both nodes have a strong connection to the controller it seems. Try a health check from 22 to 24 or vice versa, not with the controller as the target.

However, 46 has a very bad health report

It looks like the connection is okay in one direction and terrible in the other. 1 -> 46: 0/10 means that all pings reached the node and all acknowledgements were received by the controller. In the other direction either some pings got lost (did not reach the controller), or their acknowledgements did not reach the node. I'm guessing it's the latter, since acknowledgements get lost more easily (short RF frame).

You can see that apparently the route was switched during the first testing round, probably because it dropped off. Would be interesting to know if the SNR margin is that bad because of noise near the controller (you can watch the RSSI graph if you click on the controller and look for spikes up), or because the node's route has a low RSSI value.

This is definitely more prominent on the node 19 checks. The controller can reach the node just fine, but it seems the node has trouble reaching the controller. Probably also caused by noise/interference near the node or along the route - you can check the info panel which route is actually taken here.

Node 46 is now 51 (rating increased from 2/10 to 5/10 without any changes)

after re-inclusion it probably uses a different route which is better than the previous one. Latency seems a bit high though - check if that is on 40 kbps or maybe even 9.6k?

Health Check Node 50 Health Check Node 36 (plugged into node 50): Health Checks for 23

It seems like this isn't a good spot for those nodes. We probably shouldn't show a green rating for a node with a bad SNR margin (around 0 means the signal is indistinguishable from noise). Not sure if Node 23 supports pinging the controller. I guess not and that you'd get 10/10 failed pings in that direction too.

Can you move 23 and 36 a meter or two? Having a repeater at the exact same spot where 50 has issues seems suboptimal. After you do that, do a single-node heal on all 3 nodes so they get new and hopefully better routes. Then redo their health check.

zwave-js-assistant[bot] commented 1 year ago

This issue has not seen any recent activity and was marked as "stale 💤". Closing for housekeeping purposes... 🧹

Feel free to reopen if the issue persists.

zwave-js / node-zwave-js

Interview failing at ProtocolInfo for nodes that already have been successfully included #5766