zigpy / zigpy-deconz

A library which communicates with Deconz radios for zigpy
GNU General Public License v3.0
85 stars 20 forks source link

Zigpy/Conbee/ZHA not marking SOME devices without power as "Unavailable" #166

Open ahd71 opened 2 years ago

ahd71 commented 2 years ago

I'm running Home Assistant with ZHA and a Conbee II and honestly I'm not sure where the problem is. I'm starting here and hoping for a referral if needed :-)

I have 100+ lights, most works fine, sometimes though when family turns physical switches off I get slightly inconsistent behavior. Some bulbs shows as "Unavailable" after some time in HA (which I'm expect) and others don't (the "issue").

I narrow the issue down to a minimum to demonstrate. A fixture with three zigbee lights, all three physically turned off (no power).

DEVICE 1: 0x4617 - changed to "Unavailable" in HA after some time (as expected) DEVICE 2: 0x7fdf - still shows "On" after several hours - and of course flipping on/off won't make any difference (bounce back to on) DEVICE 3: 0x1c10 - for reference, another lightbulb which works as expected and is powered.

From the logs these messages keeps repeating - do note that in 0x4617 zigpy responds with "Delivery error" which I assume ZHA/HA picks up and mark it as unavailable - but for 0x7fdf it doesn't so HA/ZHA continues to believe it is alive.

(possibly ZHA would be able to asynchronously notice that there isn't any reply coming and also mark it unavailable after some time - or for zigpy/conbee to send a proper "failed to send" message back.


DEVICE 1: Correctly marking as Unavailable:

2021-09-20 13:47:56 DEBUG (MainThread) [zigpy_deconz.api] Command Command.aps_data_request (17, 250, 0, <DeconzAddressEndpoint address_mode=2 address=0x4617 endpoint=0>, 0, <ZDOCmd.Mgmt_Lqi_req: 0x0031>, 0, b'\xf9\x00', 2, 0) 2021-09-20 13:48:06 DEBUG (MainThread) [zigpy_deconz.api] Request id: 0xfa 'aps_data_confirm' for <DeconzAddressEndpoint address_mode=ADDRESS_MODE.NWK address=0x4617 endpoint=0>, status: 0xd0 2021-09-20 13:48:06 DEBUG (MainThread) [zigpy.device] [0x4617] Delivery error for seq # 0xf9, on endpoint id 0 cluster 0x0031: message send failure 2021-09-20 13:48:07 DEBUG (MainThread) [zigpy_deconz.api] Command Command.aps_data_request (17, 254, 0, <DeconzAddressEndpoint address_mode=2 address=0x4617 endpoint=0>, 0, <ZDOCmd.Mgmt_Lqi_req: 0x0031>, 0, b'\xfd\x00', 2, 0) 2021-09-20 13:48:17 DEBUG (MainThread) [zigpy_deconz.api] Request id: 0xfe 'aps_data_confirm' for <DeconzAddressEndpoint address_mode=ADDRESS_MODE.NWK address=0x4617 endpoint=0>, status: 0xd0 2021-09-20 13:48:17 DEBUG (MainThread) [zigpy.device] [0x4617] Delivery error for seq # 0xfd, on endpoint id 0 cluster 0x0031: message send failure 2021-09-20 13:48:18 DEBUG (MainThread) [zigpy_deconz.api] Command Command.aps_data_request (17, 9, 0, <DeconzAddressEndpoint address_mode=2 address=0x4617 endpoint=0>, 0, <ZDOCmd.Mgmt_Lqi_req: 0x0031>, 0, b'\x08\x00', 2, 0) 2021-09-20 13:48:28 DEBUG (MainThread) [zigpy_deconz.api] Request id: 0x09 'aps_data_confirm' for <DeconzAddressEndpoint address_mode=ADDRESS_MODE.NWK address=0x4617 endpoint=0>, status: 0xd0 2021-09-20 13:48:28 DEBUG (MainThread) [zigpy.device] [0x4617] Delivery error for seq # 0x08, on endpoint id 0 cluster 0x0031: message send failure


DEVICE 2: Not marking it as unavailable even though it is having no power since 5+ hours:

2021-09-20 13:47:36 DEBUG (MainThread) [zigpy_deconz.api] Command Command.aps_data_request (17, 244, 0, <DeconzAddressEndpoint address_mode=2 address=0x7FDF endpoint=0>, 0, <ZDOCmd.Mgmt_Lqi_req: 0x0031>, 0, b'\xf3\x00', 2, 0) 2021-09-20 13:47:36 DEBUG (MainThread) [zigpy_deconz.api] Request id: 0xf4 'aps_data_confirm' for <DeconzAddressEndpoint address_mode=ADDRESS_MODE.NWK address=0x7fdf endpoint=0>, status: 0x00 2021-09-20 13:47:42 DEBUG (MainThread) [zigpy_deconz.api] Command Command.aps_data_request (17, 246, 0, <DeconzAddressEndpoint address_mode=2 address=0x7FDF endpoint=0>, 0, <ZDOCmd.Mgmt_Lqi_req: 0x0031>, 0, b'\xf5\x00', 2, 0) 2021-09-20 13:47:42 DEBUG (MainThread) [zigpy_deconz.api] Request id: 0xf6 'aps_data_confirm' for <DeconzAddressEndpoint address_mode=ADDRESS_MODE.NWK address=0x7fdf endpoint=0>, status: 0x00 2021-09-20 13:47:48 DEBUG (MainThread) [zigpy_deconz.api] Command Command.aps_data_request (17, 248, 0, <DeconzAddressEndpoint address_mode=2 address=0x7FDF endpoint=0>, 0, <ZDOCmd.Mgmt_Lqi_req: 0x0031>, 0, b'\xf7\x00', 2, 0) 2021-09-20 13:47:48 DEBUG (MainThread) [zigpy_deconz.api] Request id: 0xf8 'aps_data_confirm' for <DeconzAddressEndpoint address_mode=ADDRESS_MODE.NWK address=0x7fdf endpoint=0>, status: 0x00 2021-09-20 14:17:53 DEBUG (MainThread) [homeassistant.components.zha.core.device] [0x7FDF](TRADFRI bulb GU10 W 400lm): Attempting to checkin with device - missed checkins: 1 2021-09-20 14:17:53 DEBUG (MainThread) [zigpy_deconz.api] Command Command.aps_data_request (20, 244, 0, <DeconzAddressEndpoint address_mode=2 address=0x7FDF endpoint=1>, 260, 0, 1, b'\x00\xf3\x00\x04\x00', 2, 0) 2021-09-20 14:17:53 DEBUG (MainThread) [zigpy_deconz.api] Request id: 0xf4 'aps_data_confirm' for <DeconzAddressEndpoint address_mode=ADDRESS_MODE.NWK address=0x7fdf endpoint=1>, status: 0x00

DEVICE 3: And for reference another device in the same room which is online and working as expected:

2021-09-20 15:28:26 DEBUG (MainThread) [zigpy_deconz.api] Command Command.aps_data_request (20, 253, 0, <DeconzAddressEndpoint address_mode=2 address=0x1C10 endpoint=1>, 260, 6, 1, b'\x00\xfc\x00\x00\x00', 2, 0) 2021-09-20 15:28:26 DEBUG (MainThread) [zigpy_deconz.api] Request id: 0xfd 'aps_data_confirm' for <DeconzAddressEndpoint address_mode=ADDRESS_MODE.NWK address=0x1c10 endpoint=1>, status: 0x00 2021-09-20 15:28:26 DEBUG (MainThread) [zigpy_deconz.api] APS data indication response: [31, <DeviceState.APSDE_DATA_REQUEST_SLOTS_AVAILABLE|2: 34>, <DeconzAddress address_mode=ADDRESS_MODE.NWK address=0x0000>, 1, <DeconzAddress address_mode=ADDRESS_MODE.NWK address=0x1c10>, 1, 260, 6, b'\x18\xfc\x01\x00\x00\x00\x10\x00', 0, 175, 255, 224, 104, 211, 2, -74] 2021-09-20 15:28:26 DEBUG (MainThread) [zigpy.zcl] [0x1c10:1:0x0006] ZCL deserialize: <ZCLHeader frame_control=<FrameControl frame_type=GLOBAL_COMMAND manufacturer_specific=False is_reply=True disable_default_response=True> manufacturer=None tsn=252 command_id=Command.Read_Attributes_rsp> 2021-09-20 15:28:26 DEBUG (MainThread) [zigpy_deconz.api] 'aps_data_indication' response from <DeconzAddress address_mode=ADDRESS_MODE.NWK address=0x1c10>, ep: 1, profile: 0x0104, cluster_id: 0x0006, data: b'18fc010000001000' 2021-09-20 15:28:26 DEBUG (MainThread) [zigpy_deconz.api] Command Command.aps_data_request (20, 255, 0, <DeconzAddressEndpoint address_mode=2 address=0x1C10 endpoint=1>, 260, 8, 1, b'\x00\xfe\x00\x00\x00', 2, 0) 2021-09-20 15:28:26 DEBUG (MainThread) [zigpy_deconz.api] Request id: 0xff 'aps_data_confirm' for <DeconzAddressEndpoint address_mode=ADDRESS_MODE.NWK address=0x1c10 endpoint=1>, status: 0x00 2021-09-20 15:28:26 DEBUG (MainThread) [zigpy_deconz.api] APS data indication response: [31, <DeviceState.APSDE_DATA_REQUEST_SLOTS_AVAILABLE|2: 34>, <DeconzAddress address_mode=ADDRESS_MODE.NWK address=0x0000>, 1, <DeconzAddress address_mode=ADDRESS_MODE.NWK address=0x1c10>, 1, 260, 8, b'\x18\xfe\x01\x00\x00\x00 \xfe', 0, 175, 255, 224, 104, 211, 2, -75] 2021-09-20 15:28:26 DEBUG (MainThread) [zigpy.zcl] [0x1c10:1:0x0008] ZCL deserialize: <ZCLHeader frame_control=<FrameControl frame_type=GLOBAL_COMMAND manufacturer_specific=False is_reply=True disable_default_response=True> manufacturer=None tsn=254 command_id=Command.Read_Attributes_rsp> 2021-09-20 15:28:26 DEBUG (MainThread) [zigpy_deconz.api] 'aps_data_indication' response from <DeconzAddress address_mode=ADDRESS_MODE.NWK address=0x1c10>, ep: 1, profile: 0x0104, cluster_id: 0x0008, data: b'18fe0100000020fe' 2021-09-20 15:28:26 DEBUG (MainThread) [homeassistant.components.zha.core.channels.base] [0x1C10:1:0x0008]: received attribute: 0 update with value: 254 2021-09-20 15:28:26 DEBUG (MainThread) [zigpy_deconz.api] Command Command.aps_data_request (28, 1, 0, <DeconzAddressEndpoint address_mode=2 address=0x1C10 endpoint=1>, 260, 768, 1, b'\x00\x00\x00\x08\x00\x07\x00\x03\x00\x04\x00\x02@', 2, 0) 2021-09-20 15:28:26 DEBUG (MainThread) [zigpy_deconz.api] Request id: 0x01 'aps_data_confirm' for <DeconzAddressEndpoint address_mode=ADDRESS_MODE.NWK address=0x1c10 endpoint=1>, status: 0x00 2021-09-20 15:28:27 DEBUG (MainThread) [zigpy_deconz.api] APS data indication response: [49, <DeviceState.APSDE_DATA_REQUEST_SLOTS_AVAILABLE|2: 34>, <DeconzAddress address_mode=ADDRESS_MODE.NWK address=0x0000>, 1, <DeconzAddress address_mode=ADDRESS_MODE.NWK address=0x1c10>, 1, 260, 768, b'\x18\x00\x01\x08\x00\x000\x01\x07\x00\x86\x03\x00\x00!t\xb3\x04\x00\x00!\x8aL\x02@\x86', 0, 175, 255, 224, 104, 211, 2, -75] 2021-09-20 15:28:27 DEBUG (MainThread) [zigpy.zcl] [0x1c10:1:0x0300] ZCL deserialize: <ZCLHeader frame_control=<FrameControl frame_type=GLOBAL_COMMAND manufacturer_specific=False is_reply=True disable_default_response=True> manufacturer=None tsn=0 command_id=Command.Read_Attributes_rsp> 2021-09-20 15:28:27 DEBUG (MainThread) [zigpy_deconz.api] 'aps_data_indication' response from <DeconzAddress address_mode=ADDRESS_MODE.NWK address=0x1c10>, ep: 1, profile: 0x0104, cluster_id: 0x0300, data: b'18000108000030010700860300002174b3040000218a4c024086' 2021-09-20 15:30:45 DEBUG (MainThread) [zigpy.neighbor] [0xb7d7] request status: Status.SUCCESS. response: Neighbors(entries=15, start_index=0, neighbor_table_list=[Neighbor(extended_pan_id=00:21:2e:ff:ff:04:23:36, ieee=00:0b:57:ff:fe:8a:0d:e3, nwk=0x0567, device_type=<DeviceType.Router: 1>, rx_on_when_idle=<RxOnWhenIdle.On: 1>, relationship=<RelationShip.Sibling: 2>, reserved1=0, permit_joining=<PermitJoins.Unknown: 2>, reserved2=0, depth=15, lqi=163), Neighbor(extended_pan_id=00:21:2e:ff:ff:04:23:36, ieee=00:0d:6f:ff:fe:1d:7d:b1, nwk=0x1C10, device_type=<DeviceType.Router: 1>, rx_on_when_idle=<RxOnWhenIdle.On: 1>, relationship=<RelationShip.Sibling: 2>, reserved1=0, permit_joining=<PermitJoins.Unknown: 2>, reserved2=0, depth=15, lqi=112), Neighbor(extended_pan_id=00:21:2e:ff:ff:04:23:36, ieee=d0:cf:5e:ff:fe:c0:7e:78, nwk=0x207A, device_type=<DeviceType.Router: 1>, rx_on_when_idle=<RxOnWhenIdle.On: 1>, relationship=<RelationShip.Sibling: 2>, reserved1=0, permit_joining=<PermitJoins.Unknown: 2>, reserved2=0, depth=15, lqi=125)]) 2021-09-20 15:31:13 DEBUG (MainThread) [zigpy.neighbor] [0x207a] request status: Status.SUCCESS. response: Neighbors(entries=17, start_index=0, neighbor_table_list=[Neighbor(extended_pan_id=00:21:2e:ff:ff:04:23:36, ieee=00:0b:57:ff:fe:8a:0d:e3, nwk=0x0567, device_type=<DeviceType.Router: 1>, rx_on_when_idle=<RxOnWhenIdle.On: 1>, relationship=<RelationShip.Sibling: 2>, reserved1=0, permit_joining=<PermitJoins.Unknown: 2>, reserved2=0, depth=15, lqi=144), Neighbor(extended_pan_id=00:21:2e:ff:ff:04:23:36, ieee=00:0d:6f:ff:fe:1d:7d:b1, nwk=0x1C10, device_type=<DeviceType.Router: 1>, rx_on_when_idle=<RxOnWhenIdle.On: 1>, relationship=<RelationShip.Sibling: 2>, reserved1=0, permit_joining=<PermitJoins.Unknown: 2>, reserved2=0, depth=15, lqi=118), Neighbor(extended_pan_id=00:21:2e:ff:ff:04:23:36, ieee=68:0a:e2:ff:fe:55:57:3f, nwk=0x3AC7, device_type=<DeviceType.Router: 1>, rx_on_when_idle=<RxOnWhenIdle.On: 1>, relationship=<RelationShip.Sibling: 2>, reserved1=0, permit_joining=<PermitJoins.Unknown: 2>, reserved2=0, depth=15, lqi=68)]) 2021-09-20 15:34:15 DEBUG (MainThread) [zigpy_deconz.api] APS data indication response: [36, <DeviceState.APSDE_DATA_REQUEST_SLOTS_AVAILABLE|2: 34>, <DeconzAddress address_mode=ADDRESS_MODE.NWK address=0x0000>, 1, <DeconzAddress address_mode=ADDRESS_MODE.NWK address=0x1c10>, 1, 260, 768, b'\x08c\n\x03\x00!t\xb3\x04\x00!\x8aL', 0, 175, 255, 73, 106, 211, 2, -76] 2021-09-20 15:34:15 DEBUG (MainThread) [zigpy.zcl] [0x1c10:1:0x0300] ZCL deserialize: <ZCLHeader frame_control=<FrameControl frame_type=GLOBAL_COMMAND manufacturer_specific=False is_reply=True disable_default_response=False> manufacturer=None tsn=99 command_id=Command.Report_Attributes> 2021-09-20 15:34:15 DEBUG (MainThread) [zigpy.zcl] [0x1c10:1:0x0300] ZCL request 0x000a: [[Attribute(attrid=3, value=<TypeValue type=uint16_t, value=45940>), Attribute(attrid=4, value=<TypeValue type=uint16_t, value=19594>)]] 2021-09-20 15:34:15 DEBUG (MainThread) [zigpy.zcl] [0x1c10:1:0x0300] Attribute report received: current_x=45940, current_y=19594 2021-09-20 15:34:15 DEBUG (MainThread) [zigpy_deconz.api] 'aps_data_indication' response from <DeconzAddress address_mode=ADDRESS_MODE.NWK address=0x1c10>, ep: 1, profile: 0x0104, cluster_id: 0x0300, data: b'08630a03002174b30400218a4c' 2021-09-20 15:34:15 DEBUG (MainThread) [zigpy_deconz.api] Command Command.aps_data_request (20, 143, 0, <DeconzAddressEndpoint address_mode=2 address=0x1C10 endpoint=1>, 260, 768, 1, b'\x18c\x0b\n\x00', 2, 0) 2021-09-20 15:34:15 DEBUG (MainThread) [zigpy_deconz.api] Request id: 0x8f 'aps_data_confirm' for <DeconzAddressEndpoint address_mode=ADDRESS_MODE.NWK address=0x1c10 endpoint=1>, status: 0x00

Any advice, or things I could troubleshoot to narrow it down further?

Thanks in advance!

BR/Anders

ahd71 commented 2 years ago

Another observation is that in the neighbors table the two first devices ("Gustav tak 1 and 3") are still having a number of records in the table, all with depth 15 (which is another concern that I have routing issues in the network, e.g parent of a parent of itself...?)

INDEX() device_ieee name_by_user ieee name_by_user Relationship (text) 1 00:0b:57:ff:fe:d5:32:d2 Gustav Tak 3 00:0b:57:ff:fe:8a:0d:e3 Fönsterlampa 1 MA sovrum neighbor is a sibling 2 00:0b:57:ff:fe:d5:32:d2 Gustav Tak 3 00:0b:57:ff:fe:b9:a6:b9 Fönsterlampa 2 MA sovrum neighbor is a sibling 3 00:0b:57:ff:fe:d5:32:d2 Gustav Tak 3 00:0b:57:ff:fe:db:1f:ef Gustavs fönster neighbor is a sibling 4 00:0b:57:ff:fe:d5:32:d2 Gustav Tak 3 00:0d:6f:ff:fe:9c:1a:33 Blombelysning Skrivbord neighbor is a sibling 5 00:0b:57:ff:fe:d5:32:d2 Gustav Tak 3 00:0d:6f:ff:fe:9c:4e:77 Outlet Studiolampor neighbor is a sibling 6 00:0b:57:ff:fe:d5:32:d2 Gustav Tak 3 00:0d:6f:ff:fe:f3:23:69 Förstärkare neighbor is a sibling 7 00:0b:57:ff:fe:d5:32:d2 Gustav Tak 3 08:6b:d7:ff:fe:1d:92:d8 Taklampsramp kök 3 neighbor is a sibling 8 00:0b:57:ff:fe:d5:32:d2 Gustav Tak 3 08:6b:d7:ff:fe:2b:11:36 Taklampsramp kök 1 neighbor is a sibling 9 00:0b:57:ff:fe:d5:32:d2 Gustav Tak 3 84:fd:27:ff:fe:9f:19:c2 Övre toalett neighbor is a sibling 10 00:0b:57:ff:fe:d5:32:d2 Gustav Tak 3 84:fd:27:ff:fe:aa:c6:b0 Taklampa 1 (Gillestuga) neighbor is a sibling 11 00:0b:57:ff:fe:d5:32:d2 Gustav Tak 3 90:fd:9f:ff:fe:13:ed:58 Gustav Tak 1 neighbor is a sibling 12 00:0b:57:ff:fe:d5:32:d2 Gustav Tak 3 90:fd:9f:ff:fe:d9:ca:41 Vardagsrum Fönster 1 neighbor is a sibling 13 00:0b:57:ff:fe:d5:32:d2 Gustav Tak 3 d0:cf:5e:ff:fe:03:70:a0 Vardagsrum Fönster 5 neighbor is a sibling 14 00:0b:57:ff:fe:d5:32:d2 Gustav Tak 3 d0:cf:5e:ff:fe:06:bb:83 Vardagsrum Fönster 3 neighbor is a sibling 15 00:0b:57:ff:fe:d5:32:d2 Gustav Tak 3 d0:cf:5e:ff:fe:c0:81:b4 Vardagsrum Fönster 4 neighbor is a sibling 16 00:0b:57:ff:fe:d5:32:d2 Gustav Tak 3 d0:cf:5e:ff:fe:c8:4c:a6 Taklampa (S) neighbor is a sibling 17 00:0b:57:ff:fe:db:1f:ef Gustavs fönster 00:0b:57:ff:fe:d5:32:d2 Gustav Tak 3 neighbor is a sibling 18 90:fd:9f:ff:fe:13:c7:9d Gustav Tak 2 90:fd:9f:ff:fe:13:ed:58 Gustav Tak 1 neighbor is a sibling 19 90:fd:9f:ff:fe:13:ed:58 Gustav Tak 1 00:0b:57:ff:fe:8a:43:88 Pianolampa neighbor is a sibling 20 90:fd:9f:ff:fe:13:ed:58 Gustav Tak 1 00:0b:57:ff:fe:a5:05:d7 biolp3 neighbor is a sibling 21 90:fd:9f:ff:fe:13:ed:58 Gustav Tak 1 00:0b:57:ff:fe:b9:a6:b9 Fönsterlampa 2 MA sovrum neighbor is a sibling 22 90:fd:9f:ff:fe:13:ed:58 Gustav Tak 1 00:0b:57:ff:fe:d5:32:d2 Gustav Tak 3 neighbor is a sibling 23 90:fd:9f:ff:fe:13:ed:58 Gustav Tak 1 80:4b:50:ff:fe:5e:9a:21 Vägglampa Diskbänk Vänster neighbor is a sibling 24 90:fd:9f:ff:fe:13:ed:58 Gustav Tak 1 80:4b:50:ff:fe:24:05:34 Bänkbelysning Kök 2 neighbor is a sibling 25 90:fd:9f:ff:fe:13:ed:58 Gustav Tak 1 80:4b:50:ff:fe:51:6a:b0 Bänkbelysning Kök 1 neighbor is a sibling 26 90:fd:9f:ff:fe:13:ed:58 Gustav Tak 1 84:fd:27:ff:fe:9f:19:c2 Övre toalett neighbor is a sibling 27 90:fd:9f:ff:fe:13:ed:58 Gustav Tak 1 90:fd:9f:ff:fe:13:c7:9d Gustav Tak 2 neighbor is a sibling 28 90:fd:9f:ff:fe:13:ed:58 Gustav Tak 1 90:fd:9f:ff:fe:32:03:d6 Taklampsramp kök 2 neighbor is a sibling 29 90:fd:9f:ff:fe:13:ed:58 Gustav Tak 1 bc:33:ac:ff:fe:84:1d:04 Bänkbelysning Kök 3 neighbor is a sibling 30 90:fd:9f:ff:fe:13:ed:58 Gustav Tak 1 d0:cf:5e:ff:fe:1d:3f:be Matbord vrum neighbor is a sibling 31 90:fd:9f:ff:fe:13:ed:58 Gustav Tak 1 d0:cf:5e:ff:fe:71:99:1c Taklampsramp kök 4 neighbor is a sibling 32 90:fd:9f:ff:fe:13:ed:58 Gustav Tak 1 d0:cf:5e:ff:fe:c0:7e:78 Marias Sänglampa neighbor is a sibling 33 90:fd:9f:ff:fe:13:ed:58 Gustav Tak 1 d0:cf:5e:ff:fe:c8:4c:a6 Taklampa (S) neighbor is a sibling

MattWestb commented 2 years ago

I have doing lockup your MAC addressees and all router is Silicon labs or older companys that have being bayed of them (d0:cf:5e:ff:fe:03:70:a0 Energy Micro AS) and Silicon lab have one nasty bug in the Zigbee stack that is fixed in version 6.7.7.0 but very likely your devices is not having that fix.

Most common is IKEA routers in large networks is getting the buffer locked and cant communicating until being re powered. The very bad thing the trigger is parent announcement that is triggering the bug if the router is getting enough of that messages (and the buffet is being locked and cant communicating in the network).

Its being known for around one year and Silabs have doing one fix for it but the device manufactures have not implanting it but some have saying they is working on it (IKEA) and all other with Silabs chips have not doing any thing wot i knowing.

I think the best is not to power of and on routers and end devices if its possible so not getting devices broadcasting the bad messages that is triggering the bug in the router devices.

One more thing its looks like deCONZ is more likely getting this problems but is only one feeling and then they is having many users with 100 + devices (that ZHA and Z2M also having bit i dont knowing haw large and many system that is very large with Silabs routers in them).

Adminiuga commented 2 years ago
zigpy_deconz.api] Request id: 0xff 'aps_data_confirm' for , status: 0x00

Means that request was delivered successfuly. You may be confusing it with another device.

ahd71 commented 2 years ago

Thanks @MattWestb for your detailed response. I'll send a query to IKEA and see if I get any response (I usually don't get a response for ny two previous questions/bug reports to them).

ahd71 commented 2 years ago

@Adminiuga Interesting. As both device 1 and 2 is physically off it cannot be delivered so wonder who acknowledge on their behalf. Secondly, why the difference in behavior them between. For the 1st device it always throws an "Delivery error" AFTER the "aps_data_confirm" but that doesn't happen with device 2. Strange ;-)

ahd71 commented 2 years ago

Oh, sorry you are right. The logs are indeed mixed up. Let me fix that (when I'm at a computer).

ahd71 commented 2 years ago

updated the first log files with code block (as some parts were missing otherwise) but I'm not too familiar with github and still the logs are not as readable as in plain text with line breaks in the right position.

Adminiuga commented 2 years ago

You can match aps_data_confirm to aps_data_request by the request id number. E.g.

Command.aps_data_request (17, 9, 0, , 0, <ZDOCmd.Mgmt_Lqi_req: 0x0031>, 0, b'\x08\x00', 2, 0)
2021-09-20 13:48:28 DEBUG (MainThread) [zigpy_deconz.api] Request id: 0x09 'aps_data_confirm' for , status: 0xd0

Second number in Command.aps_data_request is the request id of 9 -- in decimal format. Request id 0x09 aps data confirm -- is the request id in hexadecimal. Any status different from 0x00 indicates there was an error. So request number 9 did fail, but any all other seem to be successful.

ahd71 commented 2 years ago

Thankyou, will pay attention to it - still I find it very strange that lights that are physically turned off (yes, family...not me!) behaves differently (in this case a fixture with two lightbulbs where one of them not goes "Unavailable" but the other does - like someone responding on their behalf or assumes it is alive or similar. Will investigate a bit further myself now with some new knowledge about the acknowledgements.

ahd71 commented 2 years ago

For device #2 I get repeated messages like this every 20-30 minutes so definitely HA notices that something is wrong but still does not flag it as "Unavailable". 2021-09-20 10:53:32 DEBUG (MainThread) [homeassistant.components.zha.core.device] [0x7FDF](TRADFRI bulb GU10 W 400lm): Attempting to checkin with device - missed checkins: 1

ahd71 commented 2 years ago

It always say "missed checkins: 1" (not incrementing the number). I wonder if that can be a cause? During troubleshooting another routing issue I have tweaked the interval for scanning the network and there seems to be one scan in-between each checkin - will reset the interval and test during today.

Hypothesis to be tested: That topology_scan_period (I had it configured to 30 minutes for last week or so) needs to be larger than the checkin interval (which also was set to 30 minutes (1800 seconds in the ZHA configuration card). As the intervals according to the log isn't any exact science of timing precision there might be cases where it is sometimes above, sometimes below the threshold leading to this. Some go-back-to the logs potential evidences is that the device #1 (that shows unavailble) indeed has a missed checkin count to 2 as maximum. Let's see what the test gives me!

2021-09-20 09:31:45 DEBUG (MainThread) [zigpy.topology] Scanning neighbors of 00:0b:57:ff:fe:d5:32:d2/0x7fdf device

MattWestb commented 2 years ago

The missed checkin is very likely coming then the device have its buffer locked / cached and cant answer the request from the coordinator (the "Silabs bug").

From Zigbee EmberZNet SDK 6.7.9.0 GA

So without the fix its one time bomb if having one large network and many devices is being re powered then the routers buffer is being filed and not released = the device is being killed then it cant communicating in the network.

One ZHA user with prity large network was power off half of its lights with the circuit barker and 1/4 of the remaining lights was killed then power it back on then getting many parent accouterments and replaying to then.

Also our ZHA devs have bombing the network with Parent Announce command and can killing it with some minutes so its more then likely the problem you is getting with the bug and your setup.

I can recommending putting somthing the light power switch so it cant being toggled without removing the front and taking the blocking away (like one "radergummi") so not getting devices bring re powered by "accidents".

PS I have writing to IKEA devs but i think i have not finding the right person for this kind of technical problems then i have not getting any response also then writing in Swedish to then.

ahd71 commented 2 years ago

Thanks @MattWestb . I will also contact IKEA again to give it another push (but have the same experience as you, that it doesn't reach the right persons in the organization)

ahd71 commented 2 years ago

I've written to IKEA now, will let you know if I get any response!

ahd71 commented 2 years ago

I can also share that changing the topology_scan_period from 30 minutes to default (4-5h?) did not change the behavior. In a fixture with two bulbs one still reports Unavailable some few hours after I intentionally turned them off physically and the others is still "off" (or "on" whatever the last known state was)

In this case, is it another device (not this fixture, but it's parent or child) that has their buffers full that causes this issue?

(in that case, and unless IKEA fix the firmware, I need to power cycle and slowly turn on 96 bulbs/outlets/repeaters from IKEA to restore stable state until next massive buffer overflow attack - i think this was caused when the utility company changed the electricity meter some two months ago and obviously cut the power to the entire house - but at the same time I have also extended the installation with several new devices so hard to know the cause).

MattWestb commented 2 years ago

I was finding one interesting bug fix in Rasp/CornBee II:

Its looks like DE coordinators have making wrong response for parent accouterments and they have fixing it in the last release of Rasp/CornBee II (but not the first gen devices) so it still can being that there coordinators is / was making more problems.

I can confirming that IKEA GW is also having the same problems and normally one or 2 outlets is stop working after firmware update then the GW its rebooting and need re power the outlets for getting then working OK.

MattWestb commented 2 years ago

I think one complete shutdown is not so bad but re power some devices now and then is bad then the command is sent with broadcast so its spamming all routers in the network (I can being wrong about the broadcast but i think all "hallo" after going back in the network is broadcasts). DEs master dev was not having any problems in his home until hi moved one float panel (=power down moving and power it on) and was getting some zombie devices and re power them and some other was getting zombies and so on.

ahd71 commented 2 years ago

I did update to that conbee II firmware day after release but I have not powercycled all lights since that if that is relevant. I will definitely try that! Thanks for all valuable insights!

ahd71 commented 2 years ago

I've read up on the "Silabs bug" and indeed I think that it matches the symptoms I have. Apart from contacting IKEA via email I also wrote a comment in R/TRADFRI hoping that u/TRADFRI will pick it up.

MattWestb commented 2 years ago

The "Silabs bug" pinpointed in deCONZ forum by my and more ZHA users and verified by ZHA devs and no working work around is being found for it.

Its being posted in other forums also u/TRADFRI but i dont knowing if IKEA have reacting on the posting.

u/TRADFRI have accouterments that IKEA is on the hunt of the connection issue for some mounths ago but i dont knowing if its the same bug they is trying to fixing. IKEA have making and certifying all modern controllers but was having problems with them (group binding was not working) and they have deleting all in the test feed for on mounth ago. They have certified one new set but they is not released in the test feed but they is coming. IKEA need first updating the controllers if doing braking things so the Zigbee Light Link is working OK then they is using it for there GW and also if using there products "stand alone" and in the second place is the normal Zigbee 3 functionality for them (but i think its pity high on the list then very much devices is being used of the community and IKEA is knowing that).

It can being that Silabs is doing one patch to the SDK for IKEA so they dont need redoing all devices code then its being easier but it cam also being that they is rewriting the code for the devices in EZSP 6.7.7.0 (6.7.9.0 is the golden version for the old EFR32MG1 chips) or newer but the original Zigbee chip is little week for the newer SDKs so not knowing.

One interesting thing is that all old ZLL devices have getting updates to Zigbee 3 only the old motion sensor and the first gen CWS not (the last have getting one refresh for some mounth ago) and all devices have getting bug fixes without the old motion sensor and the old dimmer the last year and that is not so bad for one furniture builder then some other players dont fix anything after releasing there products and the customer have baying them.

ahd71 commented 2 years ago

After reading the release notes and comparing the dates of posts by u/TRADFRI they were hunting a "device unavailable issue", it was said to have a high priority, it got delayed, and eventually a release that have a line about that was released so possibly that they fixed ONE case but not this case. Transparency of what they IKEA are doing (or not doing) would benefit both customers and IKA them self. I understand it takes some time to be active in forums but believe it is well spent time to get input on things not working, beta testers, and keeping the customer happy (=sell more)

MattWestb commented 2 years ago

I think IKEA shall getting one channel to the community for test and feed back and advanced problems shutting. Its being one win win situation and is good for both parts. deCONZ is political not OK but ZHA or Z2M is one good part on the community side.

I was writing the one of the bosses for the development (finding the name and email on FCC appliances) with the problems with the scene switching with the 5 button remote cant changing the group then its not possible binding it then its not having and scene cluster and also the blinds / signal repeater also need one group to being configured for working OK in open systems (IKEA is setting the group with touch link but is not possible doing with "normal" Zigbee 3 commands) and we dont have support for setting up devices with touch link (for the moment).

IKEA like having control over the feed of internal information and if leaking it can being very bad for the company.

I have offer my being one contact person for ZHA community and can writing on some papers if needed and shall not being any problems then i being Swedish (but living in Vienna) and can easy being tracked and also law forces if being liking information that is not OK.

IKEA is not selling some thousand Zigbe devices / year. I think it was 5 milions the first year and its not being lesser and the community is being one larger part of the market and im very sure IKEA is very known of that.

They is so large that the Zigbee alliances have making them one board member and later also tuya have getting one place in the board the last year.

ahd71 commented 2 years ago

Interesting insights (ps. I'm also a Swede and have always been a fan of IKEA and also want to support where I can).

MattWestb commented 2 years ago

By the way im in holiday in Spain but was doing one "pit stop" with the car in Montpelier by IKEA for getting 2 https://www.ikea.com/fr/fr/p/styrbar-telecommande-acier-inoxydable-10435224/ for 10€ etch :-)) Cant getting the stainless steel version in Austria and its was one very good price then the old 5 button is 15€ in Vienna.

Edit: IKEA in Valencia is having it now but for 12€ and its some rainy days here :-))