zigpy / zigpy-deconz

A library which communicates with Deconz radios for zigpy
GNU General Public License v3.0
86 stars 20 forks source link

Detect when the firmware responds with the wrong response #241

Closed puddly closed 9 months ago

puddly commented 9 months ago

Detected with a Raspbee II (0x26780700) (thanks @Citizen-2CB8A24A):

# Watchdog TTL write (seq=101)
2023-12-16 00:31:24.247 DEBUG (MainThread) [zigpy_deconz.api] Sending CommandId.write_parameter{'parameter_id': <NetworkParameter.watchdog_ttl: 38>, 'parameter': b'<\x00\x00\x00'} (seq=101)
2023-12-16 00:31:24.248 DEBUG (MainThread) [zigpy_deconz.uart] Send: 0b65000c000500263c000000

# Response received, but the wrong one (device_state_changed??) (seq=101)
2023-12-16 00:31:25.470 DEBUG (MainThread) [zigpy_deconz.uart] Frame received: 0x0e65000700aa00
2023-12-16 00:31:25.471 DEBUG (MainThread) [zigpy_deconz.api] Received command CommandId.device_state_changed{'status': <Status.SUCCESS: 0>, 'frame_length': 7, 'device_state': DeviceState(network_state=<NetworkState2.CONNECTED: 2>, device_state=<DeviceStateFlags.APSDE_DATA_INDICATION|APSDE_DATA_REQUEST_FREE_SLOTS_AVAILABLE|32: 42>), 'reserved': 0} (seq 101)

# Another device_state_changed received (seq=102) right after
2023-12-16 00:31:25.648 DEBUG (MainThread) [zigpy_deconz.uart] Frame received: 0x0e66000700aa00
2023-12-16 00:31:25.650 DEBUG (MainThread) [zigpy_deconz.api] Received command CommandId.device_state_changed{'status': <Status.SUCCESS: 0>, 'frame_length': 7, 'device_state': DeviceState(network_state=<NetworkState2.CONNECTED: 2>, device_state=<DeviceStateFlags.APSDE_DATA_INDICATION|APSDE_DATA_REQUEST_FREE_SLOTS_AVAILABLE|32: 42>), 'reserved': 0} (seq 102)

From the log, you can see that the write_parameter request doesn't receive the correct response. It's almost as if the enqueued device_state_changed notification overwrites the expected write_parameter response.

A separate exception type is used to detect this rare problem and the watchdog TTL will be re-written until it succeeds.

codecov[bot] commented 9 months ago

Codecov Report

Attention: 1 lines in your changes are missing coverage. Please review.

Comparison is base (cc1f018) 98.52% compared to head (23d5670) 98.68%.

Files Patch % Lines
zigpy_deconz/exception.py 88.88% 1 Missing :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## dev #241 +/- ## ========================================== + Coverage 98.52% 98.68% +0.15% ========================================== Files 7 7 Lines 884 910 +26 ========================================== + Hits 871 898 +27 + Misses 13 12 -1 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

puddly commented 9 months ago

It looks like this problem isn't limited to just device_state_changed. Here is another log, this time for a completely different command:

2023-12-17 22:04:06.794 DEBUG (MainThread) [zigpy_deconz.api] Command Command.aps_data_confirm (0,)
2023-12-17 22:04:06.795 DEBUG (MainThread) [zigpy_deconz.uart] Send: 0x04c80007000000
2023-12-17 22:04:06.799 DEBUG (MainThread) [zigpy_deconz.uart] Frame received: 0x04c8050800010022
2023-12-17 22:04:06.801 DEBUG (MainThread) [zigpy_deconz.api] Command Command.aps_data_indication (1, <DataIndicationFlags.Always_Use_NWK_Source_Addr: 1>)

A seq of 0xC8 is used for the aps_data_confirm request and the firmware erroneously responds with aps_data_indication.

To fix this, I'm globally enabling "invalid response" retries for all commands.

manup commented 9 months ago

Hi @puddly some background information, this is due the nature of a few async commands. While most commands act in a sync way there is the exception of ZGP and device state "notifications" which can be send by the firmware at any time. This is done so that the host application is notified immediately that new APS indications, APS confirm or other changes are available to poll.

In the deCONZ implementation all commands which are send to the firmware are handled in a queue which tracks the sequence number + command id. All incoming commands are checked against this queue. If above mentioned notifications like device state arrive, new respective commands to query APS indications/confirm are scheduled, to speed up everything compared to plain polling. This part is not open source yet but I can provide you the sources if that is of any help.

puddly commented 9 months ago

Thank you! I think I have this implemented in zigpy-deconz by triggering an event to instantly poll every time device_state is sent back either within any command response or by these notifications.

From the logs above, it seems like the sequence number desynchronizes, as the response sequence number matches the request sequence number from an unrelated command sent nearly 200ms in the past. Currently, I just endlessly retry if this is the case, as the chances of this occurring are very low.

In the future, it would be nice to reserve a sequence number for unsolicited notifications (e.g. 0x00 or 0xFF) and avoid using it during normal communications. Or maybe pick a random sequence number during a notification so that it does not collide with a pending one?

But it is good to know that aps_data_indication, device_state_changed, and zigbee_green_power are the only frames that can do this. I think this can help simplify the logic in a followup PR.