Closed dumpfheimer closed 3 days ago
Could you email me (or upload here) a complete log of it crashing? I've been running that firmware for a few days and have experienced no problems with it. zigpy-znp should recover even from complete coordinator lock-ups (i.e. where the serial port is still alive) and try to reconnect.
The serial port is not alive anymore when this happens. The only thing that helps is power cycling the coordinator. Do you need debug turned on? Or should warnings be enough?
I see. So the lockup is detected, but the coordinator itself will never reset and zigpy-znp endlessly tries to reconnect? Are you using a TCP coordinator or is this a USB one?
It's a Cc1352P2 launchpad connected via USB.
Yes if I remember correctly it seemed like zigpy recognized the connection issue and tried to reconnect but everything seemed to result in TimeoutError s (from the top of my head)
Turned on debug logging
Thanks. I feel like this is still a firmware problem: reading too slowly from a serial port shouldn't cause an unrecoverable crash, even if the other side of the serial connection is doing something to cause it.
That being said, moving pyserial-asyncio to another thread like you did may be necessary. We already do this with bellows (https://github.com/zigpy/bellows/blob/3c3ee0296d35eb43d0493eb9b2160bc4484e892c/bellows/uart.py#L386).
I'm with you. No matter what the controller should stay up IMHO. It just got me thinking that z2m users seem to not have the same issues.
I am having a good experience with the dedicated thread. But I still get the seemingly random lock ups (eg at night).
Thanks for the link to bellows, maybe I can get some ideas for my znp PoC.
@dumpfheimer Can you replicate the crash with https://github.com/puddly/serialpy?
You'll have to slightly patch zigpy-znp to use it but I'm wondering if pyserial itself may be the cause:
diff --git a/zigpy_znp/uart.py b/zigpy_znp/uart.py
index 5571e60..eb662e1 100644
--- a/zigpy_znp/uart.py
+++ b/zigpy_znp/uart.py
@@ -21,6 +21,10 @@ with warnings.catch_warnings():
import serial_asyncio # noqa: E402
+import serialpy as serial
+import serialpy as serial_asyncio
+
+
LOGGER = logging.getLogger(__name__)
I think I have it running. Can I somehow verify that it's using your serialpy impl?
If you add serialpy.serial: debug
to your HA logging config, you will see lines like this on startup:
2022-08-25 15:38:47.774 ubuntu zigpy_znp.api DEBUG Toggling RTS/DTR pins to skip bootloader or reset chip
2022-08-25 15:38:47.775 ubuntu zigpy_znp.uart DEBUG Setting serial pin states: DTR=False, RTS=False
2022-08-25 15:38:47.775 ubuntu serialpy.serial DEBUG Clearing modem bits: 0x00000002
2022-08-25 15:38:47.775 ubuntu serialpy.serial DEBUG Clearing modem bits: 0x00000004
2022-08-25 15:38:47.926 ubuntu zigpy_znp.uart DEBUG Setting serial pin states: DTR=False, RTS=True
2022-08-25 15:38:47.926 ubuntu serialpy.serial DEBUG Clearing modem bits: 0x00000002
2022-08-25 15:38:47.927 ubuntu serialpy.serial DEBUG Setting modem bits: 0x00000004
2022-08-25 15:38:48.079 ubuntu zigpy_znp.uart DEBUG Setting serial pin states: DTR=False, RTS=False
2022-08-25 15:38:48.079 ubuntu serialpy.serial DEBUG Clearing modem bits: 0x00000002
2022-08-25 15:38:48.079 ubuntu serialpy.serial DEBUG Clearing modem bits: 0x00000004
Can you make sense of this error?
2022-08-25` 22:07:00.143 DEBUG (MainThread) [zigpy_znp.uart] Connecting to /dev/serial/by-id/usb-Texas_Instruments_XDS110__03.00.00.20__Embed_with_CMSIS-DAP_L4300230-if00 at 115200 baud
2022-08-25 22:07:00.147 DEBUG (MainThread) [zigpy_znp.api] Connection to /dev/serial/by-id/usb-Texas_Instruments_XDS110__03.00.00.20__Embed_with_CMSIS-DAP_L4300230-if00 failed, cleaning up
2022-08-25 22:07:00.147 ERROR (MainThread) [zigpy.application] Couldn't start application
Traceback (most recent call last):
File "/srv/homeassistant/lib/python3.10/site-packages/zigpy/application.py", line 106, in startup
await self.connect()
File "/srv/homeassistant/lib/python3.10/site-packages/zigpy_znp/zigbee/application.py", line 111, in connect
await znp.connect()
File "/srv/homeassistant/lib/python3.10/site-packages/zigpy_znp/api.py", line 688, in connect
self._uart = await uart.connect(self._config[conf.CONF_DEVICE], self)
File "/srv/homeassistant/lib/python3.10/site-packages/zigpy_znp/uart.py", line 186, in connect
_, protocol = await serial_asyncio.create_serial_connection(
File "/srv/homeassistant/lib/python3.10/site-packages/serialpy/__init__.py", line 67, in create_serial_connection
transport = transport_factory(
File "/srv/homeassistant/lib/python3.10/site-packages/serialpy/__init__.py", line 30, in __init__
self._serial = Serial(
File "/srv/homeassistant/lib/python3.10/site-packages/serialpy/serial.py", line 111, in __init__ self._file = os.fdopen(self._fileno, "rb+")
File "/usr/lib/python3.10/os.py", line 1029, in fdopen
return io.open(fd, mode, buffering, encoding, *args, **kwargs)
io.UnsupportedOperation: File or stream is not seekable.
2022-08-25 22:07:00.153 ERROR (MainThread) [homeassistant] Error doing job: Exception in callback ZnpMtProtocol.connection_made(<serialpy.Ser...x7eff2c21ef00>): File "/srv/homeassistant/lib/python3.10/site-packages/zigpy/application.py", line 126, in new
await app.startup(auto_form=auto_form)
File "/srv/homeassistant/lib/python3.10/site-packages/zigpy/application.py", line 106, in startup
await self.connect()
File "/srv/homeassistant/lib/python3.10/site-packages/zigpy_znp/zigbee/application.py", line 111, in connect
await znp.connect()
File "/srv/homeassistant/lib/python3.10/site-packages/zigpy_znp/api.py", line 688, in connect
self._uart = await uart.connect(self._config[conf.CONF_DEVICE], self)
File "/srv/homeassistant/lib/python3.10/site-packages/zigpy_znp/uart.py", line 186, in connect
_, protocol = await serial_asyncio.create_serial_connection(
File "/srv/homeassistant/lib/python3.10/site-packages/serialpy/__init__.py", line 67, in create_serial_connection
transport = transport_factory(
File "/srv/homeassistant/lib/python3.10/site-packages/serialpy/__init__.py", line 29, in __init__
super().__init__(loop, protocol, path, waiter, extra)
File "/srv/homeassistant/lib/python3.10/site-packages/serialpy/descriptor_transport.py", line 50, in __init__
self._loop.call_soon(self._protocol.connection_made, self)
Traceback (most recent call last):
File "/usr/lib/python3.10/asyncio/events.py", line 80, in _run
self._context.run(self._callback, *self._args)
File "/srv/homeassistant/lib/python3.10/site-packages/zigpy_znp/uart.py", line 66, in connection_made
LOGGER.debug("Opened %s serial port", transport.serial.name)
File "/srv/homeassistant/lib/python3.10/site-packages/serialpy/__init__.py", line 44, in serial
return self._serial
AttributeError: 'SerialTransport' object has no attribute '_serial'
2022-08-25 22:07:00.155 ERROR (MainThread) [homeassistant] Error doing job: Exception in callback BaseSelectorEventLoop.add_reader(44, <bound method...7eff2c21ef00>>): File "/srv/homeassistant/lib/python3.10/site-packages/zigpy/application.py", line 126, in new
await app.startup(auto_form=auto_form)
File "/srv/homeassistant/lib/python3.10/site-packages/zigpy/application.py", line 106, in startup
await self.connect()
File "/srv/homeassistant/lib/python3.10/site-packages/zigpy_znp/zigbee/application.py", line 111, in connect
await znp.connect()
File "/srv/homeassistant/lib/python3.10/site-packages/zigpy_znp/api.py", line 688, in connect
self._uart = await uart.connect(self._config[conf.CONF_DEVICE], self)
File "/srv/homeassistant/lib/python3.10/site-packages/zigpy_znp/uart.py", line 186, in connect
_, protocol = await serial_asyncio.create_serial_connection(
File "/srv/homeassistant/lib/python3.10/site-packages/serialpy/__init__.py", line 67, in create_serial_connection
transport = transport_factory(
File "/srv/homeassistant/lib/python3.10/site-packages/serialpy/__init__.py", line 29, in __init__
super().__init__(loop, protocol, path, waiter, extra)
File "/srv/homeassistant/lib/python3.10/site-packages/serialpy/descriptor_transport.py", line 51, in __init__
self._loop.call_soon(self._loop.add_reader, self._fileno, self._read_ready)
Traceback (most recent call last):
File "/usr/lib/python3.10/asyncio/selector_events.py", line 256, in _add_reader
key = self._selector.get_key(fd)
File "/usr/lib/python3.10/selectors.py", line 193, in get_key
raise KeyError("{!r} is not registered".format(fileobj)) from None
KeyError: '44 is not registered'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/lib/python3.10/asyncio/events.py", line 80, in _run
self._context.run(self._callback, *self._args)
File "/usr/lib/python3.10/asyncio/selector_events.py", line 331, in add_reader
self._add_reader(fd, callback, *args)
File "/usr/lib/python3.10/asyncio/selector_events.py", line 258, in _add_reader
self._selector.register(fd, selectors.EVENT_READ,
File "/usr/lib/python3.10/selectors.py", line 360, in register
self._selector.register(key.fd, poller_events)
OSError: [Errno 9] Bad file descriptor
2022-08-25 22:07:00.178 ERROR (MainThread) [homeassistant.components.zha.core.gateway] Couldn't start ZNP = Texas Instruments Z-Stack ZNP protocol: CC253x, CC26x2, CC13x2 coordinator
Traceback (most recent call last):
File "/srv/homeassistant/lib/python3.10/site-packages/homeassistant/components/zha/core/gateway.py", line 170, in async_initialize
self.application_controller = await app_controller_cls.new(
File "/srv/homeassistant/lib/python3.10/site-packages/zigpy/application.py", line 126, in new
await app.startup(auto_form=auto_form)
File "/srv/homeassistant/lib/python3.10/site-packages/zigpy/application.py", line 106, in startup
await self.connect()
File "/srv/homeassistant/lib/python3.10/site-packages/zigpy_znp/zigbee/application.py", line 111, in connect
await znp.connect()
File "/srv/homeassistant/lib/python3.10/site-packages/zigpy_znp/api.py", line 688, in connect
self._uart = await uart.connect(self._config[conf.CONF_DEVICE], self)
File "/srv/homeassistant/lib/python3.10/site-packages/zigpy_znp/uart.py", line 186, in connect
_, protocol = await serial_asyncio.create_serial_connection(
File "/srv/homeassistant/lib/python3.10/site-packages/serialpy/__init__.py", line 67, in create_serial_connection
transport = transport_factory(
File "/srv/homeassistant/lib/python3.10/site-packages/serialpy/__init__.py", line 30, in __init__
self._serial = Serial(
File "/srv/homeassistant/lib/python3.10/site-packages/serialpy/serial.py", line 111, in __init__ self._file = os.fdopen(self._fileno, "rb+")
File "/usr/lib/python3.10/os.py", line 1029, in fdopen
return io.open(fd, mode, buffering, encoding, *args, **kwargs)
io.UnsupportedOperation: File or stream is not seekable.```
Maybe buffering=0? open("/dev/tty", "r+b", buffering=0)
Seems like that did the trick
Nice find, thanks!
Okay, got your latest serialpy (starts fine now, thanks) and flashed the 20220726 firmware again. Also, I disabled my "threaded zigpy-znp stuff" just to be sure.
Just curious, was there a reason you started your own serial implementation?
Pyserial has an entirely synchronous API and pyserial-asyncio as a shim doesn't actually delegate file descriptor reads to the event loop: it still uses select()
internally, after the event loop notifies it of new data.
This is more of a test than anything else. Maintaining an entire serial library that is more than just POSIX-compliant is a little out of scope for zigpy but if this alternative implementation doesn't have the same bugs, it may help narrow down where to look.
Very interesting! And a bit over my head :-D
My controller locked up again over night. These are the last lines:
2022-08-26 03:15:05.741 DEBUG (MainThread) [zigpy_znp.api] Sending request: SYS.Ping.Req()
2022-08-26 03:15:05.742 DEBUG (MainThread) [homeassistant.components.zha.core.channels.base] [0xE087:1:0x0b04]: failed to get attributes '['active_power', 'rms_current', 'rms_voltage']' on 'electrical_measurement' cluster:
2022-08-26 03:15:05.812 DEBUG (MainThread) [homeassistant.components.zha.core.channels.base] [0xE087:1:0x0702]: failed to get attributes '['current_summ_delivered', 'status']' on 'smartenergy_metering' cluster:
2022-08-26 03:15:10.742 ERROR (MainThread) [zigpy_znp.zigbee.application] Watchdog check failed
Traceback (most recent call last):
File "/srv/homeassistant/lib/python3.10/site-packages/zigpy_znp/api.py", line 984, in request
response = await response_future
asyncio.exceptions.CancelledError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/srv/homeassistant/lib/python3.10/site-packages/zigpy_znp/zigbee/application.py", line 709, in _watchdog_loop
await self._znp.request(c.SYS.Ping.Req())
File "/srv/homeassistant/lib/python3.10/site-packages/zigpy_znp/api.py", line 982, in request
async with async_timeout.timeout(self._znp_config[conf.CONF_SREQ_TIMEOUT]):
File "/srv/homeassistant/lib/python3.10/site-packages/async_timeout/__init__.py", line 129, in __aexit__
self._do_exit(exc_type)
File "/srv/homeassistant/lib/python3.10/site-packages/async_timeout/__init__.py", line 212, in _do_exit
raise asyncio.TimeoutError
asyncio.exceptions.TimeoutError
2022-08-26 03:15:10.743 DEBUG (MainThread) [zigpy_znp.zigbee.application] Connection lost:
2022-08-26 03:15:10.744 DEBUG (MainThread) [zigpy_znp.uart] Closing serial port
2022-08-26 03:15:10.745 DEBUG (MainThread) [zigpy_znp.zigbee.application] Restarting background reconnection task
I will now try again with the dedicated thread in zigpy znp
I am not sure if restarting HA couple of times would make things better or worse, but up until now I have had no issue with serialpy + dedicated ZNP thread
Still no crash π
Still no crash, I think you're on to something with the serialpy π
Was it crashing before, using your serial thread but with pyserial instead?
Yes, this is the first setup that seems stable with a SimpleLink SDK 6 firmware
My controller froze over night π will try to find something in the logs
One thing that just came to my mind: I have quite a few smart plugs witch energy measurement. (TS011F)
When I started using them in December they did not Report some Attributes on their own - and I think they still don't. I believe someone built a timer in zigpy / quirks that does that now (if I remember correctly every 30s).
One thing that MIGHT make a difference is if one of these plugs is unavailable - which was not the case the night before, but was the case tonight.
Maybe route discovery has a memory leak or something similar? Maybe the timer is messing with IO?
This might be a clue! I went to the device page of an unavailable device (that plug) and went to the electrical measurement cluster and requested the current power few dozen time. My controller crashed within a few hours. Until then the log was full with route discovery requests and errors
Does someone know if/how the attribute polling could affect the controller/serial loop?
Hmmm. Maybe try dropping the request concurrency? It should be 16 for you. Maybe try 4 or 8? https://github.com/zigpy/zigpy-znp#configuration
I think we should just try to get a proper debug report from one of the TI dev kits of the state of the microcontroller when it crashes and submit a bug report to Texas Instruments.
I'll try that!
How could I do the dump? I have a headless server. Is there a open source tool in the Ubuntu repositories that could do that?
Is it possible to initiate a route discovery manually? I would like to try to automatically issue a bunch of route discovery requests to try and see if that might be the issue.
You can:
Remove the @combine_concurrent_calls
decorator to make it send one every time you call the function.
Correct me if I'm wrong, but should the function/function name not be part of the key? https://github.com/zigpy/zigpy-znp/blob/824c2b2ade1e2ecfeb55087b9375a1df33eebb34/zigpy_znp/utils.py#L184
If I read the code correctly the key would be the same for both functions using @combine_concurrent_calls: https://github.com/zigpy/zigpy-znp/blob/824c2b2ade1e2ecfeb55087b9375a1df33eebb34/zigpy_znp/zigbee/application.py#L1072 https://github.com/zigpy/zigpy-znp/blob/824c2b2ade1e2ecfeb55087b9375a1df33eebb34/zigpy_znp/zigbee/application.py#L722
tasks
is created every time combine_concurrent_calls
is called (once for every decorated function) and the returned replacement function is a closure that references that same object:
def decorator(func):
cache = {}
def replacement(*args, **kwargs):
print("Current cache is", cache)
cache[func.__name__] = True
return func(*args, **kwargs), cache
return replacement
@decorator
def func1():
return 1
@decorator
def func2():
return 2
if __name__ == "__main__":
assert func1() == (1, {"func1": True})
assert func2() == (2, {"func2": True})
assert func2() == (2, {"func2": True})
assert func1() == (1, {"func1": True})
Ah, thanks, got it =)
Another question: I don't know if its even possible, but IF the task in combine_concurrent_calls returns early (for whatever reason) and the assertion fails.. Would that mean that the task is never removed from the array? https://github.com/zigpy/zigpy-znp/blob/824c2b2ade1e2ecfeb55087b9375a1df33eebb34/zigpy_znp/utils.py#L194-L195
had no success with excessive route discovery requests btw =/
It would but I don't think it's possible to not be await
ing the task and for it to not either have an exception, a result, or be cancelled. Any of the three are "done". You'd see a logged error if that happened, which I've not seen so far.
Thanks!
I'm a bit out of ideas in the mean time.. Can only wait for crahes and see if I can find a pattern..
I could try to intercept ther serial communication and log that..?
Could the logger be causing a delay? It's the only thing (IO) left in the thread that is reading from the serial port
Not if you disable it. The loggers intentionally use string formatting syntax to avoid unnecessary work if the log level is disabled.
I removed all logs from the zigpy thread.. I'll see.. Is your serial port also configured with "UART unknown"?
# setserial /dev/serial/by-id/usb-Texas_Instruments_XDS110__03.00.00.20__Embed_with_CMSIS-DAP_L4300230-if03
/dev/serial/by-id/usb-Texas_Instruments_XDS110__03.00.00.20__Embed_with_CMSIS-DAP_L4300230-if03, UART: unknown, Port: 0x0000, IRQ: 0
There seems to be a Low latency setting for serial ports in Linux.. maybe that could help
I did not retry this but I before set the max concurrent requests to 64 and it crashed and got unresponsive. Maybe you can reproduce the lock up this way?
does the zigpy backup honor the maximum concurrent requests? I think I just had a freeze triggered by a backup
Backups don't send requests so that limit doesn't apply. Z-Stack also supports only a single in-flight request at a time, which zigpy-znp honors:
Thanks again for your fast reply. Could it be that in the background there are incoming messages (eg attribute push mesages) piling up and consuming all the memory?
I think I'll give up for now. I flashed a V5 firmware again. Maybe I can use a spare controller to make a test setup.
Hey @puddly I just thought my coordinator crashed, but it seems like it was a backup that blocked all requests and stalled them. Is the complete backup done in one sequence without time for requests inbetween?
It's done asynchronously, both happen at the same time and whichever one enqueues a request first goes first.
That being said, only a partial backup is taken at runtime after the complete one is taken on radio startup, and only once every 24 hours. It takes 0.02s for one to run for me.
Is the serial port supposed to be closed just before the backup?
No, backups just send commands and receive responses, the serial port is never closed at runtime (unless the watchdog closes it after 30s of unresponsiveness).
Would a backup be made immediately after a reconnect? I seem to have a "closing serial port" message within a few seconds before every Backup
When using koenkk`s development firmware, that is built on SimpleLink SDK Version 6.10 or 6.20, the coordinator seems to freeze under certain conditions.
It seems like memory issues that causes the lock-ups, which might be triggered at high loads or simply after some time by chance.
I believe this might be an issue only since a month or two and, while I do believe the root cause is somewhere in the coordinator firmware, I think it was fairly recent changes in zigpy that started triggering the bug. This is why:
I have had the development firmware from Feb running since Feb without issues. Some time a month ago the issues started (I was very likely on zigpy dev) when I upgraded the firmware to the latest dev build from koenkk. Downgrading to Feb firmware did not fix the issue. I had to downgrade all the way to a SDK 5.x.x Version to have a stable environment again.
Have there been noticable changes in July (+/-)?
@puddly you commented on an issue I created here and mentioned RAM usage here .
Is there something that could be done within zigpy to reduce memory usage on the controller (without loss of function mentioned here )? Or do you, zigpy devs, believe this must be fixed in SimpleLink?
Thanks for your work, it's very much appreciated.