zigpy / zigpy-znp

TI CC2531, CC13x2, CC26x2 radio support for Zigpy and ZHA
GNU General Public License v3.0
149 stars 40 forks source link

Devices become unavailable but come back after restarting Controller (or bulb) #76

Closed dumpfheimer closed 3 days ago

dumpfheimer commented 3 years ago

I am using ZHA in Home Assistant and "quite frequently" (every few days) have bulbs that seem to be unresponsive. They show up as unavailable in HA and do not seem to respond to any Cluster Commands.

I always thought it's the bulbs fault and simply restarted the bulb by power cycling it. But I noticed that often it also works to restart the controller (I am using a Lauch XL Cc1352-P; I believe).

This makes me believe that the issue could actually be fixed in Ziggy. Any help on how I can approach fixing this is appreciated.

puddly commented 3 years ago

Are you running the most recent build of Z-Stack on the device? There were a few lockup bugs with older firmware releases.

dumpfheimer commented 3 years ago

HA says "Texas Instruments CC1352/CC2652, Z-Stack 3.30+ (build 20210120)"

I believe this is the latest version (at least if you are talking about koenkk firmware builds)

puddly commented 3 years ago

Is it always a specific bulb or is it the entire network?

Give the most recent build from the develop branch a try: https://github.com/Koenkk/Z-Stack-firmware/tree/develop/coordinator/Z-Stack_3.x.0/bin

You should take a network backup before upgrading and restore it after upgrading, to ensure your network settings aren't erased. I don't recall the UNIFLASH defaults but it's better to be safe than sorry.

If that doesn't fix things, it would be most helpful it you recorded ZHA debug logs for a few days, until the issue appears.

dumpfheimer commented 3 years ago

Got it, than you very much for helping so quickly! I will flash the dev image and turn on debugging.

The problem occurs with multiple lamps of multiple vendors (hue, gledopto, tradfri) and I had some issues with Heimann smoke detectors too (they were changing their NWK when changing their primary router) but that was tackled with your PR https://github.com/zigpy/zigpy-znp/pull/73 .

Will report back, thanks again.

dumpfheimer commented 3 years ago

During the network backup i got these messages:

2021-07-04 20:03:46.066 server zigpy_znp.znp.security WARNING Ignoring invalid address manager entry: AddrMgrEntry(type=<AddrMgrUserType.Assoc: 1>, nwkAddr=0xFFFE, extAddr=14:ff:fe:ed:71:cb:5d:08) 2021-07-04 20:03:46.074 server zigpy_znp.znp.security WARNING Ignoring invalid address manager entry: AddrMgrEntry(type=<AddrMgrUserType.Assoc: 1>, nwkAddr=0xFFFE, extAddr=14:ff:fe:ed:71:cb:78:04) 2021-07-04 20:03:46.076 server zigpy_znp.znp.security WARNING Ignoring invalid address manager entry: AddrMgrEntry(type=<AddrMgrUserType.Assoc: 1>, nwkAddr=0xFFFE, extAddr=4b:00:22:47:df:a2:c3:9d) 2021-07-04 20:03:46.083 server zigpy_znp.znp.security WARNING Ignoring invalid address manager entry: AddrMgrEntry(type=<AddrMgrUserType.Assoc: 1>, nwkAddr=0xFFFE, extAddr=e2:ff:fe:f0:d3:e7:69:81) 2021-07-04 20:03:46.085 server zigpy_znp.znp.security WARNING Ignoring invalid address manager entry: AddrMgrEntry(type=<AddrMgrUserType.Assoc: 1>, nwkAddr=0xFFFE, extAddr=4b:00:22:47:e2:70:2b:4b) 2021-07-04 20:03:46.086 server zigpy_znp.znp.security WARNING Ignoring invalid address manager entry: AddrMgrEntry(type=<AddrMgrUserType.Assoc: 1>, nwkAddr=0xFFFE, extAddr=4b:00:22:47:e2:70:8e:8d) 2021-07-04 20:03:46.089 server zigpy_znp.znp.security WARNING Ignoring invalid address manager entry: AddrMgrEntry(type=<AddrMgrUserType.Assoc: 1>, nwkAddr=0xFFFE, extAddr=4b:00:22:47:df:a2:88:26) 2021-07-04 20:03:46.091 server zigpy_znp.znp.security WARNING Ignoring invalid address manager entry: AddrMgrEntry(type=<AddrMgrUserType.Assoc: 1>, nwkAddr=0xFFFE, extAddr=4b:00:22:47:e2:70:e2:dc) 2021-07-04 20:03:46.092 server zigpy_znp.znp.security WARNING Ignoring invalid address manager entry: AddrMgrEntry(type=<AddrMgrUserType.Assoc: 1>, nwkAddr=0xFFFE, extAddr=14:ff:fe:ed:71:cb:aa:fe) 2021-07-04 20:03:46.094 server zigpy_znp.znp.security WARNING Ignoring invalid address manager entry: AddrMgrEntry(type=<AddrMgrUserType.Assoc: 1>, nwkAddr=0xFFFE, extAddr=4b:00:22:47:df:a2:8c:6e) 2021-07-04 20:03:46.096 server zigpy_znp.znp.security WARNING Ignoring invalid address manager entry: AddrMgrEntry(type=<AddrMgrUserType.Assoc: 1>, nwkAddr=0xFFFE, extAddr=4b:00:22:47:df:a2:34:c9) 2021-07-04 20:03:46.099 server zigpy_znp.znp.security WARNING Ignoring invalid address manager entry: AddrMgrEntry(type=<AddrMgrUserType.Assoc: 1>, nwkAddr=0xFFFE, extAddr=14:ff:fe:ed:71:cb:eb:30) 2021-07-04 20:03:46.100 server zigpy_znp.znp.security WARNING Ignoring invalid address manager entry: AddrMgrEntry(type=<AddrMgrUserType.Assoc: 1>, nwkAddr=0xFFFE, extAddr=14:ff:fe:ed:71:cb:65:3b) 2021-07-04 20:03:46.102 server zigpy_znp.znp.security WARNING Ignoring invalid address manager entry: AddrMgrEntry(type=<AddrMgrUserType.Assoc: 1>, nwkAddr=0xFFFE, extAddr=14:ff:fe:ed:77:6e:84:0a) 2021-07-04 20:03:46.107 server zigpy_znp.znp.security WARNING Ignoring invalid address manager entry: AddrMgrEntry(type=<AddrMgrUserType.Assoc: 1>, nwkAddr=0xFFFE, extAddr=14:ff:fe:ed:77:23:12:5f) 2021-07-04 20:03:48.787 server zigpy_znp.znp.security WARNING Skipping hashed link key ... (tx: 4851, rx: 0) for unknown device 08:6b:d7:ff:fe:5d:0b:9b 2021-07-04 20:03:51.027 server main INFO TCLK seed: ...

The interesting thing is, that the ieee addresses seem to be "shfted". 2021-07-04 20:03:46.100 server zigpy_znp.znp.security WARNING Ignoring invalid address manager entry: AddrMgrEntry(type=<AddrMgrUserType.Assoc: 1>, nwkAddr=0xFFFE, extAddr=14:ff:fe:ed:71:cb:65:3b)

i have no such ieee address, but i do have 84:2e:14:ff:fe:ed:71:cb

Might this be part of the problem?

puddly commented 3 years ago

Yeah, possibly. This is a firmware bug though, not something I can fix in zigpy-znp: https://github.com/zigpy/zigpy-znp/issues/65

Are these devices still on your network? I've only noticed this happening when devices move parents from the coordinator and these are just "stale" entries.

dumpfheimer commented 3 years ago

Thanks for the link!

If I understand correctly, the cause of this issue is not quite found yet. Is there a possible workaround? Could I eg delete NVRAM and do a network restore?

puddly commented 3 years ago

The network restore overwrites all of those NVRAM entries but you can do a full erase and then a restore if you want, it shouldn't hurt.

Before you do that, if you're comfortable posting it here, could you perform an NVRAM backup?

MattWestb commented 3 years ago

Puddly is also having one email address in its header but i dont knowing if hi is using it but its mush better then putting network keys on forums but if you is reforming the network its no danger then using on new network key :-))

dumpfheimer commented 3 years ago

I saw the key but my understanding was it is a key that is valid only for the one - unknown - device. Was that incorrect?

dumpfheimer commented 3 years ago

I have a NVRAM backup and will send it to you @puddly by email - tomorrow. If anything else is of interest let me know.

If this works it might be worth considering making a backup + immediate restore available in the UI as part of some maintenance/troubleshooting process.

puddly commented 3 years ago

I saw the key but my understanding was it is a key that is valid only for the one - unknown - device. Was that incorrect?

Every APS link key in Z-Stack is computed from the TCLK seed, a device's IEEE address, and a shift of 0-15, making it possible to determine every current and future APS link key for every device (if a device uses APS encryption). The computation is entirely reversible too (for whatever reason), so it's actually enough to leak one key and one IEEE address (which isn't really private) or two keys to compute the internal seed.

Now, whether or not that means anything is another matter, since doing anything with this information requires someone to physically stand outside your house to determine the IEEE addresses of the Zigbee 3.0 devices on your network potentially using APS link keys. If your threat model includes an attacker standing outside your home, they're in sniffing range and could have picked all these keys up anyways if you didn't join your devices with Zigbee 3.0 install codes.

If this works it might be worth considering making a backup + immediate restore available in the UI as part of some maintenance/troubleshooting process.

It's some sort of bug with Z-Stack. I've never personally had this glitch impact any of my Z-Stack networks using the same hardware as you so it'd probably be better to put in the effort to identify what can trigger this bug instead of making the backup/restore hack (if it works) more user-friendly.

dumpfheimer commented 3 years ago

Ok if lights randomly turn on and off while someone is standing outside my house I will consider forming a new network.

But seriously, thanks for the clarification. Always interesting to learn how things work. And next time I read "seed" I will hopefully be aware of it's significance.

I totally agree that it would be best to identify the underlying root cause and making it more user friendly may also not be the right way to go. But if this actually works it might be worth mentioning it in FAQ/Troubleshooting/Docs. This has been haunting me for months now 🤪

dumpfheimer commented 3 years ago

By the way, I have successfully reset NVRAM and restored the network backup. Everything seems to have come back up as if nothing ever changed - beautiful. I will send you the NVRAM bin tomorrow, I cannot SCP it with my phone and I am "trying to fall asleep"

dumpfheimer commented 3 years ago

So, over night I lost 4 Aqara Window Sensors - but no lights. I am having difficultys pairing them again, though. Will try to find out what's going on.

dumpfheimer commented 3 years ago

Just checking in: I struggled a bit with repairing those sensors. Not quite sure why they were lost either. I believe my expectation that "joining devices" would also use bulbs to detect new devices was false? when i started "joining over specific devices" that were near the sensor things worked better.

Nevertheless, since then everything seems to be working fine! Will come back for more updates

MattWestb commented 3 years ago

The "joining devices thru this" is working great but some end devices is not liking some routers. I have many Aqara sensors and some of them dont like being forces joining thru IKEA lights. The weather sensor I is OK but the version II is tricky. I normally using one old HA1.2 router for joining then and then and then "killing" the router and pressing the test button so its moving to one new parent that is near its final place and its normally working OK (but not always).

Adminiuga commented 3 years ago

So, over night I lost 4 Aqara Window Sensors

That no news. Google for "aqara dropping off the network" and aqara compatible routers. Joining some aqara devices, like water sensors and door sensors was always tricky, as it needs more or less exact timing. I usually get them on the 4th or 5th try. Also, aqara does not look for the strongest signal, it just try to pair to 1st beacon it heard, which usually not the best parent. I.e. don't open the entire network for joining, but open specific compatible routers.

dumpfheimer commented 3 years ago

It was to me, I hadn't lost a window sensor in months. I can live with pairing being a pain but only if I don't have to do it every other day.

Today over night HA did not show anything as unavailable, but an other window sensor was not updating. I "fixed" it by pressing the reset button shortly which I believe makes it choose another parent. I have not noticed anything malfunctioning. But I do have one Remote Control (An IKEA 5 Button remote) where it says the power sensor is unavailable. Is this to be expected? The remote seems to be working fine.

MattWestb commented 3 years ago

For IKEA remotes do one "reconfigure" from the device card and waking the remote up then sending the commands.

IKEA remotes is working OK but is lazy reporting if not fixing the reporting that is not being OK after pairing or making one new binding so its best reconfigure them also if doing on new binding to one group.

dumpfheimer commented 3 years ago

Well this week has passed pretty quickly.

I have had no mentionable problems with my ZigBee devices! This is absolutely genius.

Thanks for your help and thanks for all the work you put into this, greatly appreciated!

Would you be willing to accept a PR with a doc update mentioning the possibility of making a backup and restoring for troubleshooting?

puddly commented 3 years ago

That's good to hear! Maybe the corrupted NVRAM entries are negatively affecting Z-Stack somehow, as I do recall seeing the coordinator broadcasting the 0xFFFE devices in Wireshark in a "parent announce". No clue what repercussions that can have though, if anything they would appear like unreachable child devices.

Would you be willing to accept a PR with a doc update mentioning the possibility of making a backup and restoring for troubleshooting?

Sure, thank you.