zigpy / bellows

A Python 3 project to implement EZSP for EmberZNet devices
GNU General Public License v3.0
182 stars 86 forks source link

:Error code: ERROR_EXCEEDED_MAXIMUM_ACK_TIMEOUT_COUNT, #617

Closed pipiche38 closed 5 months ago

pipiche38 commented 5 months ago

Since the recent version of zigpy & bellows we are having such issue.

2024-04-12 20:00:14,109 DEBUG   :Error code: ERROR_EXCEEDED_MAXIMUM_ACK_TIMEOUT_COUNT, Version: 2, frame: b'c20251a8bd7e'
2024-04-12 20:00:14,112 ERROR   :NCP entered failed state. Requesting APP controller restart

If we fall back to previous version we do no get this error.

zigpy==0.59.0 bellows==0.36.8

We have put in place an automatic restart when getting the connection lost, but it is weird not understand what is the root-cause.

A full loog file is available here. This is a large network with around more than 90 devices https://github.com/zigpy/zigpy/discussions/1378#discussioncomment-9099362

puddly commented 5 months ago

ERROR_EXCEEDED_MAXIMUM_ACK_TIMEOUT_COUNT means the radio firmware did not receive an ACK in time and has reset itself. This is handled automatically by bellows so it usually means that the Python thread is locked up or otherwise slow. Make sure you have no blocking operations stalling the event loop.

pipiche38 commented 5 months ago

can you clarify which thread ? is that the specific bellows Thread ? And when you say "the radio firmware did not receive an ACK in time", is that an ACK which has to be delivered by the zigpy stack ( this specific thread) ?

Is the EventLoop thread a must for Bellows, or what is the impact if we disable it ?

[EDIT] I had a look at #598 and #604 and I do not fully understand. On the other side, as part of the Zigbee4Domoticz plugin we are creating a dedicated EventLoop thread, and I wonder what the benefits or the disadvantages to spawn a second thread for the bellows EventLoop. In addition you left the possibility via the CONFIG parameter to disable the Thread, so I wonder if then the thread is a must or not

puddly commented 5 months ago

can you clarify which thread ? is that the specific bellows Thread ?

Yes, it's a part of bellows.

is that an ACK which has to be delivered by the zigpy stack ( this specific thread) ?

It's delivered by bellows in its thread.

I wonder if then the thread is a must or not

If the thread is disabled, you'll get more errors like this.

pipiche38 commented 5 months ago

If I summarize your assumption, the issue is related to the fact that the Thread where the EventLoop is do not get enough time and so all tasks in the EventLoop are delayed .

MattWestb commented 5 months ago

I agree with you then my production system with many integrations and was always having problems loosing contact with 2 cromcasts and have starting loosing contact with the EZSP for some month ago and still running stable firmware (6.10.x) with 80 devices. Test systems is the largest 25 devices normally is soled (RCP 25 devices) and other small never have problem with the communication with the EZSP / RCP. I think changing the UART drive and changes in HA that is locking the event loop is the bad things that is doing it but i dont knowing what integration is doing it or its ZHA.

pipiche38 commented 5 months ago

@MattWestb we are not using HA, we are using zigpy with the Zigbee4Domoticz plugin. This is embedded python in a C++ application. The user who is having those issues as more than 90 devices ( 70 routers), and there are quiet some heavy zigbee traffic.

@puddly can I suggest adding in the EventLoop the following lines. In such we could instrument and get the confirmation if the task is delayed for more than xx ms. Don't know what is the expected time the EZSP expects the ack.

self.loop.set_debug(enabled=True)
self.loop.slow_callback_duration = 0.01 

PS/ If the EZSP expect an ACK why are we not processing this request synchronously ?

[EDIT] I have investigated on the system where we have the issue, and I found a matching patern, which is related to a Database backup started at the same time. So most-likely the system is heavy loaded from IO and CPU poitn of view and less time is given to the threads/EventLoop to do their work. In addition the fact that Bellows is created its own thread with its own EventLoop probably doesn't help

pipiche38 commented 5 months ago

I guess we are talking about :

2024-04-23 13:16:19,648 WARNING :Executing <Handle ThreadsafeProxy.__getattr__.<locals>.func_wrapper.<locals>.check_result_wrapper() at /usr/local/lib/python3.9/dist-packages/bellows/thread.py:110 created at /usr/local/lib/python3.9/dist-packages/bellows/thread.py:120> took 1.015 seconds
2024-04-23 13:16:27,239 WARNING :Executing <Handle ThreadsafeProxy.__getattr__.<locals>.func_wrapper.<locals>.check_result_wrapper() at /usr/local/lib/python3.9/dist-packages/bellows/thread.py:110 created at /usr/local/lib/python3.9/dist-packages/bellows/thread.py:120> took 1.003 seconds
2024-04-23 13:16:28,244 WARNING :Executing <Handle ThreadsafeProxy.__getattr__.<locals>.func_wrapper.<locals>.check_result_wrapper() at /usr/local/lib/python3.9/dist-packages/bellows/thread.py:110 created at /usr/local/lib/python3.9/dist-packages/bellows/thread.py:120> took 1.003 seconds
2024-04-23 13:16:29,247 WARNING :Executing <Handle ThreadsafeProxy.__getattr__.<locals>.func_wrapper.<locals>.check_result_wrapper() at /usr/local/lib/python3.9/dist-packages/bellows/thread.py:110 created at /usr/local/lib/python3.9/dist-packages/bellows/thread.py:120> took 1.002 seconds
2024-04-23 13:16:31,258 WARNING :Executing <Handle ThreadsafeProxy.__getattr__.<locals>.func_wrapper.<locals>.check_result_wrapper() at /usr/local/lib/python3.9/dist-packages/bellows/thread.py:110 created at /usr/local/lib/python3.9/dist-packages/bellows/thread.py:120> took 2.007 seconds
2024-04-23 16:50:28,791 WARNING :Executing <Handle ThreadsafeProxy.__getattr__.<locals>.func_wrapper.<locals>.check_result_wrapper() at /usr/local/lib/python3.9/dist-packages/bellows/thread.py:110 created at /usr/local/lib/python3.9/dist-packages/bellows/thread.py:120> took 0.153 seconds

Do we have any idea on what is MAXIMUM_ACK_TIMEOUT_COUNT and how much time it represent ?

pipiche38 commented 5 months ago

The problem we are facing after investigation is that GIL is not allocating time to other thread due to an IO block during a backup. Trying to fix that one the Domoticz application