Closed pipiche38 closed 5 months ago
ERROR_EXCEEDED_MAXIMUM_ACK_TIMEOUT_COUNT
means the radio firmware did not receive an ACK in time and has reset itself. This is handled automatically by bellows so it usually means that the Python thread is locked up or otherwise slow. Make sure you have no blocking operations stalling the event loop.
can you clarify which thread ? is that the specific bellows Thread ? And when you say "the radio firmware did not receive an ACK in time", is that an ACK which has to be delivered by the zigpy stack ( this specific thread) ?
Is the EventLoop thread a must for Bellows, or what is the impact if we disable it ?
[EDIT] I had a look at #598 and #604 and I do not fully understand. On the other side, as part of the Zigbee4Domoticz plugin we are creating a dedicated EventLoop thread, and I wonder what the benefits or the disadvantages to spawn a second thread for the bellows EventLoop. In addition you left the possibility via the CONFIG parameter to disable the Thread, so I wonder if then the thread is a must or not
can you clarify which thread ? is that the specific bellows Thread ?
Yes, it's a part of bellows.
is that an ACK which has to be delivered by the zigpy stack ( this specific thread) ?
It's delivered by bellows in its thread.
I wonder if then the thread is a must or not
If the thread is disabled, you'll get more errors like this.
If I summarize your assumption, the issue is related to the fact that the Thread where the EventLoop is do not get enough time and so all tasks in the EventLoop are delayed .
I agree with you then my production system with many integrations and was always having problems loosing contact with 2 cromcasts and have starting loosing contact with the EZSP for some month ago and still running stable firmware (6.10.x) with 80 devices. Test systems is the largest 25 devices normally is soled (RCP 25 devices) and other small never have problem with the communication with the EZSP / RCP. I think changing the UART drive and changes in HA that is locking the event loop is the bad things that is doing it but i dont knowing what integration is doing it or its ZHA.
@MattWestb we are not using HA, we are using zigpy with the Zigbee4Domoticz plugin. This is embedded python in a C++ application. The user who is having those issues as more than 90 devices ( 70 routers), and there are quiet some heavy zigbee traffic.
@puddly can I suggest adding in the EventLoop the following lines. In such we could instrument and get the confirmation if the task is delayed for more than xx ms. Don't know what is the expected time the EZSP expects the ack.
self.loop.set_debug(enabled=True)
self.loop.slow_callback_duration = 0.01
PS/ If the EZSP expect an ACK why are we not processing this request synchronously ?
[EDIT] I have investigated on the system where we have the issue, and I found a matching patern, which is related to a Database backup started at the same time. So most-likely the system is heavy loaded from IO and CPU poitn of view and less time is given to the threads/EventLoop to do their work. In addition the fact that Bellows is created its own thread with its own EventLoop probably doesn't help
I guess we are talking about :
2024-04-23 13:16:19,648 WARNING :Executing <Handle ThreadsafeProxy.__getattr__.<locals>.func_wrapper.<locals>.check_result_wrapper() at /usr/local/lib/python3.9/dist-packages/bellows/thread.py:110 created at /usr/local/lib/python3.9/dist-packages/bellows/thread.py:120> took 1.015 seconds
2024-04-23 13:16:27,239 WARNING :Executing <Handle ThreadsafeProxy.__getattr__.<locals>.func_wrapper.<locals>.check_result_wrapper() at /usr/local/lib/python3.9/dist-packages/bellows/thread.py:110 created at /usr/local/lib/python3.9/dist-packages/bellows/thread.py:120> took 1.003 seconds
2024-04-23 13:16:28,244 WARNING :Executing <Handle ThreadsafeProxy.__getattr__.<locals>.func_wrapper.<locals>.check_result_wrapper() at /usr/local/lib/python3.9/dist-packages/bellows/thread.py:110 created at /usr/local/lib/python3.9/dist-packages/bellows/thread.py:120> took 1.003 seconds
2024-04-23 13:16:29,247 WARNING :Executing <Handle ThreadsafeProxy.__getattr__.<locals>.func_wrapper.<locals>.check_result_wrapper() at /usr/local/lib/python3.9/dist-packages/bellows/thread.py:110 created at /usr/local/lib/python3.9/dist-packages/bellows/thread.py:120> took 1.002 seconds
2024-04-23 13:16:31,258 WARNING :Executing <Handle ThreadsafeProxy.__getattr__.<locals>.func_wrapper.<locals>.check_result_wrapper() at /usr/local/lib/python3.9/dist-packages/bellows/thread.py:110 created at /usr/local/lib/python3.9/dist-packages/bellows/thread.py:120> took 2.007 seconds
2024-04-23 16:50:28,791 WARNING :Executing <Handle ThreadsafeProxy.__getattr__.<locals>.func_wrapper.<locals>.check_result_wrapper() at /usr/local/lib/python3.9/dist-packages/bellows/thread.py:110 created at /usr/local/lib/python3.9/dist-packages/bellows/thread.py:120> took 0.153 seconds
Do we have any idea on what is MAXIMUM_ACK_TIMEOUT_COUNT and how much time it represent ?
The problem we are facing after investigation is that GIL is not allocating time to other thread due to an IO block during a backup. Trying to fix that one the Domoticz application
Since the recent version of zigpy & bellows we are having such issue.
If we fall back to previous version we do no get this error.
We have put in place an automatic restart when getting the connection lost, but it is weird not understand what is the root-cause.
A full loog file is available here. This is a large network with around more than 90 devices https://github.com/zigpy/zigpy/discussions/1378#discussioncomment-9099362