zigpy / zigpy-znp

TI CC2531, CC13x2, CC26x2 radio support for Zigpy and ZHA
GNU General Public License v3.0
144 stars 40 forks source link

Coordinator freezes with SimpleLink SDK >= 6.x.x #165

Open dumpfheimer opened 1 year ago

dumpfheimer commented 1 year ago

When using koenkk`s development firmware, that is built on SimpleLink SDK Version 6.10 or 6.20, the coordinator seems to freeze under certain conditions.

It seems like memory issues that causes the lock-ups, which might be triggered at high loads or simply after some time by chance.

I believe this might be an issue only since a month or two and, while I do believe the root cause is somewhere in the coordinator firmware, I think it was fairly recent changes in zigpy that started triggering the bug. This is why:

I have had the development firmware from Feb running since Feb without issues. Some time a month ago the issues started (I was very likely on zigpy dev) when I upgraded the firmware to the latest dev build from koenkk. Downgrading to Feb firmware did not fix the issue. I had to downgrade all the way to a SDK 5.x.x Version to have a stable environment again.

Have there been noticable changes in July (+/-)?

@puddly you commented on an issue I created here and mentioned RAM usage here .

Is there something that could be done within zigpy to reduce memory usage on the controller (without loss of function mentioned here )? Or do you, zigpy devs, believe this must be fixed in SimpleLink?

Thanks for your work, it's very much appreciated.

puddly commented 1 year ago

Could you email me (or upload here) a complete log of it crashing? I've been running that firmware for a few days and have experienced no problems with it. zigpy-znp should recover even from complete coordinator lock-ups (i.e. where the serial port is still alive) and try to reconnect.

dumpfheimer commented 1 year ago

The serial port is not alive anymore when this happens. The only thing that helps is power cycling the coordinator. Do you need debug turned on? Or should warnings be enough?

puddly commented 1 year ago

I see. So the lockup is detected, but the coordinator itself will never reset and zigpy-znp endlessly tries to reconnect? Are you using a TCP coordinator or is this a USB one?

dumpfheimer commented 1 year ago

It's a Cc1352P2 launchpad connected via USB.

Yes if I remember correctly it seemed like zigpy recognized the connection issue and tried to reconnect but everything seemed to result in TimeoutError s (from the top of my head)

dumpfheimer commented 1 year ago

Turned on debug logging

puddly commented 1 year ago

Thanks. I feel like this is still a firmware problem: reading too slowly from a serial port shouldn't cause an unrecoverable crash, even if the other side of the serial connection is doing something to cause it.

That being said, moving pyserial-asyncio to another thread like you did may be necessary. We already do this with bellows (https://github.com/zigpy/bellows/blob/3c3ee0296d35eb43d0493eb9b2160bc4484e892c/bellows/uart.py#L386).

dumpfheimer commented 1 year ago

I'm with you. No matter what the controller should stay up IMHO. It just got me thinking that z2m users seem to not have the same issues.

I am having a good experience with the dedicated thread. But I still get the seemingly random lock ups (eg at night).

Thanks for the link to bellows, maybe I can get some ideas for my znp PoC.

puddly commented 1 year ago

@dumpfheimer Can you replicate the crash with https://github.com/puddly/serialpy?

You'll have to slightly patch zigpy-znp to use it but I'm wondering if pyserial itself may be the cause:

diff --git a/zigpy_znp/uart.py b/zigpy_znp/uart.py
index 5571e60..eb662e1 100644
--- a/zigpy_znp/uart.py
+++ b/zigpy_znp/uart.py
@@ -21,6 +21,10 @@ with warnings.catch_warnings():
     import serial_asyncio  # noqa: E402

+import serialpy as serial
+import serialpy as serial_asyncio
+
+
 LOGGER = logging.getLogger(__name__)
dumpfheimer commented 1 year ago

I think I have it running. Can I somehow verify that it's using your serialpy impl?

puddly commented 1 year ago

If you add serialpy.serial: debug to your HA logging config, you will see lines like this on startup:

2022-08-25 15:38:47.774 ubuntu zigpy_znp.api DEBUG Toggling RTS/DTR pins to skip bootloader or reset chip
2022-08-25 15:38:47.775 ubuntu zigpy_znp.uart DEBUG Setting serial pin states: DTR=False, RTS=False
2022-08-25 15:38:47.775 ubuntu serialpy.serial DEBUG Clearing modem bits: 0x00000002
2022-08-25 15:38:47.775 ubuntu serialpy.serial DEBUG Clearing modem bits: 0x00000004
2022-08-25 15:38:47.926 ubuntu zigpy_znp.uart DEBUG Setting serial pin states: DTR=False, RTS=True
2022-08-25 15:38:47.926 ubuntu serialpy.serial DEBUG Clearing modem bits: 0x00000002
2022-08-25 15:38:47.927 ubuntu serialpy.serial DEBUG Setting modem bits: 0x00000004
2022-08-25 15:38:48.079 ubuntu zigpy_znp.uart DEBUG Setting serial pin states: DTR=False, RTS=False
2022-08-25 15:38:48.079 ubuntu serialpy.serial DEBUG Clearing modem bits: 0x00000002
2022-08-25 15:38:48.079 ubuntu serialpy.serial DEBUG Clearing modem bits: 0x00000004
dumpfheimer commented 1 year ago

Can you make sense of this error?


2022-08-25` 22:07:00.143 DEBUG (MainThread) [zigpy_znp.uart] Connecting to /dev/serial/by-id/usb-Texas_Instruments_XDS110__03.00.00.20__Embed_with_CMSIS-DAP_L4300230-if00 at 115200 baud
2022-08-25 22:07:00.147 DEBUG (MainThread) [zigpy_znp.api] Connection to /dev/serial/by-id/usb-Texas_Instruments_XDS110__03.00.00.20__Embed_with_CMSIS-DAP_L4300230-if00 failed, cleaning up
2022-08-25 22:07:00.147 ERROR (MainThread) [zigpy.application] Couldn't start application
Traceback (most recent call last):
  File "/srv/homeassistant/lib/python3.10/site-packages/zigpy/application.py", line 106, in startup
    await self.connect()
  File "/srv/homeassistant/lib/python3.10/site-packages/zigpy_znp/zigbee/application.py", line 111, in connect
    await znp.connect()
  File "/srv/homeassistant/lib/python3.10/site-packages/zigpy_znp/api.py", line 688, in connect
    self._uart = await uart.connect(self._config[conf.CONF_DEVICE], self)
  File "/srv/homeassistant/lib/python3.10/site-packages/zigpy_znp/uart.py", line 186, in connect
    _, protocol = await serial_asyncio.create_serial_connection(
  File "/srv/homeassistant/lib/python3.10/site-packages/serialpy/__init__.py", line 67, in create_serial_connection
    transport = transport_factory(
  File "/srv/homeassistant/lib/python3.10/site-packages/serialpy/__init__.py", line 30, in __init__
    self._serial = Serial(
  File "/srv/homeassistant/lib/python3.10/site-packages/serialpy/serial.py", line 111, in __init__                                         self._file = os.fdopen(self._fileno, "rb+")
  File "/usr/lib/python3.10/os.py", line 1029, in fdopen
    return io.open(fd, mode, buffering, encoding, *args, **kwargs)
io.UnsupportedOperation: File or stream is not seekable.
2022-08-25 22:07:00.153 ERROR (MainThread) [homeassistant] Error doing job: Exception in callback ZnpMtProtocol.connection_made(<serialpy.Ser...x7eff2c21ef00>):   File "/srv/homeassistant/lib/python3.10/site-packages/zigpy/application.py", line 126, in new
    await app.startup(auto_form=auto_form)
  File "/srv/homeassistant/lib/python3.10/site-packages/zigpy/application.py", line 106, in startup
    await self.connect()
  File "/srv/homeassistant/lib/python3.10/site-packages/zigpy_znp/zigbee/application.py", line 111, in connect
    await znp.connect()
  File "/srv/homeassistant/lib/python3.10/site-packages/zigpy_znp/api.py", line 688, in connect
    self._uart = await uart.connect(self._config[conf.CONF_DEVICE], self)
  File "/srv/homeassistant/lib/python3.10/site-packages/zigpy_znp/uart.py", line 186, in connect
    _, protocol = await serial_asyncio.create_serial_connection(
  File "/srv/homeassistant/lib/python3.10/site-packages/serialpy/__init__.py", line 67, in create_serial_connection
    transport = transport_factory(
  File "/srv/homeassistant/lib/python3.10/site-packages/serialpy/__init__.py", line 29, in __init__
    super().__init__(loop, protocol, path, waiter, extra)
  File "/srv/homeassistant/lib/python3.10/site-packages/serialpy/descriptor_transport.py", line 50, in __init__
    self._loop.call_soon(self._protocol.connection_made, self)
Traceback (most recent call last):
  File "/usr/lib/python3.10/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
  File "/srv/homeassistant/lib/python3.10/site-packages/zigpy_znp/uart.py", line 66, in connection_made
    LOGGER.debug("Opened %s serial port", transport.serial.name)
  File "/srv/homeassistant/lib/python3.10/site-packages/serialpy/__init__.py", line 44, in serial
    return self._serial
AttributeError: 'SerialTransport' object has no attribute '_serial'
2022-08-25 22:07:00.155 ERROR (MainThread) [homeassistant] Error doing job: Exception in callback BaseSelectorEventLoop.add_reader(44, <bound method...7eff2c21ef00>>):   File "/srv/homeassistant/lib/python3.10/site-packages/zigpy/application.py", line 126, in new
    await app.startup(auto_form=auto_form)
  File "/srv/homeassistant/lib/python3.10/site-packages/zigpy/application.py", line 106, in startup
    await self.connect()
  File "/srv/homeassistant/lib/python3.10/site-packages/zigpy_znp/zigbee/application.py", line 111, in connect
    await znp.connect()
  File "/srv/homeassistant/lib/python3.10/site-packages/zigpy_znp/api.py", line 688, in connect
    self._uart = await uart.connect(self._config[conf.CONF_DEVICE], self)
  File "/srv/homeassistant/lib/python3.10/site-packages/zigpy_znp/uart.py", line 186, in connect
    _, protocol = await serial_asyncio.create_serial_connection(
  File "/srv/homeassistant/lib/python3.10/site-packages/serialpy/__init__.py", line 67, in create_serial_connection
    transport = transport_factory(
  File "/srv/homeassistant/lib/python3.10/site-packages/serialpy/__init__.py", line 29, in __init__
    super().__init__(loop, protocol, path, waiter, extra)
  File "/srv/homeassistant/lib/python3.10/site-packages/serialpy/descriptor_transport.py", line 51, in __init__
    self._loop.call_soon(self._loop.add_reader, self._fileno, self._read_ready)
Traceback (most recent call last):
  File "/usr/lib/python3.10/asyncio/selector_events.py", line 256, in _add_reader
    key = self._selector.get_key(fd)
  File "/usr/lib/python3.10/selectors.py", line 193, in get_key
    raise KeyError("{!r} is not registered".format(fileobj)) from None
KeyError: '44 is not registered'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.10/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
  File "/usr/lib/python3.10/asyncio/selector_events.py", line 331, in add_reader
    self._add_reader(fd, callback, *args)
  File "/usr/lib/python3.10/asyncio/selector_events.py", line 258, in _add_reader
    self._selector.register(fd, selectors.EVENT_READ,
  File "/usr/lib/python3.10/selectors.py", line 360, in register
    self._selector.register(key.fd, poller_events)
OSError: [Errno 9] Bad file descriptor
2022-08-25 22:07:00.178 ERROR (MainThread) [homeassistant.components.zha.core.gateway] Couldn't start ZNP = Texas Instruments Z-Stack ZNP protocol: CC253x, CC26x2, CC13x2 coordinator
Traceback (most recent call last):
  File "/srv/homeassistant/lib/python3.10/site-packages/homeassistant/components/zha/core/gateway.py", line 170, in async_initialize
    self.application_controller = await app_controller_cls.new(
  File "/srv/homeassistant/lib/python3.10/site-packages/zigpy/application.py", line 126, in new
    await app.startup(auto_form=auto_form)
  File "/srv/homeassistant/lib/python3.10/site-packages/zigpy/application.py", line 106, in startup
    await self.connect()
  File "/srv/homeassistant/lib/python3.10/site-packages/zigpy_znp/zigbee/application.py", line 111, in connect
    await znp.connect()
  File "/srv/homeassistant/lib/python3.10/site-packages/zigpy_znp/api.py", line 688, in connect
    self._uart = await uart.connect(self._config[conf.CONF_DEVICE], self)
  File "/srv/homeassistant/lib/python3.10/site-packages/zigpy_znp/uart.py", line 186, in connect
    _, protocol = await serial_asyncio.create_serial_connection(
  File "/srv/homeassistant/lib/python3.10/site-packages/serialpy/__init__.py", line 67, in create_serial_connection
    transport = transport_factory(
  File "/srv/homeassistant/lib/python3.10/site-packages/serialpy/__init__.py", line 30, in __init__
    self._serial = Serial(
  File "/srv/homeassistant/lib/python3.10/site-packages/serialpy/serial.py", line 111, in __init__                                         self._file = os.fdopen(self._fileno, "rb+")
  File "/usr/lib/python3.10/os.py", line 1029, in fdopen
    return io.open(fd, mode, buffering, encoding, *args, **kwargs)
io.UnsupportedOperation: File or stream is not seekable.```
dumpfheimer commented 1 year ago

Maybe buffering=0? open("/dev/tty", "r+b", buffering=0)

From https://bugs.python.org/issue20074

dumpfheimer commented 1 year ago

Seems like that did the trick

puddly commented 1 year ago

Nice find, thanks!

dumpfheimer commented 1 year ago

Okay, got your latest serialpy (starts fine now, thanks) and flashed the 20220726 firmware again. Also, I disabled my "threaded zigpy-znp stuff" just to be sure.

dumpfheimer commented 1 year ago

Just curious, was there a reason you started your own serial implementation?

puddly commented 1 year ago

Pyserial has an entirely synchronous API and pyserial-asyncio as a shim doesn't actually delegate file descriptor reads to the event loop: it still uses select() internally, after the event loop notifies it of new data.

This is more of a test than anything else. Maintaining an entire serial library that is more than just POSIX-compliant is a little out of scope for zigpy but if this alternative implementation doesn't have the same bugs, it may help narrow down where to look.

dumpfheimer commented 1 year ago

Very interesting! And a bit over my head :-D

My controller locked up again over night. These are the last lines:

2022-08-26 03:15:05.741 DEBUG (MainThread) [zigpy_znp.api] Sending request: SYS.Ping.Req()
2022-08-26 03:15:05.742 DEBUG (MainThread) [homeassistant.components.zha.core.channels.base] [0xE087:1:0x0b04]: failed to get attributes '['active_power', 'rms_current', 'rms_voltage']' on 'electrical_measurement' cluster:
2022-08-26 03:15:05.812 DEBUG (MainThread) [homeassistant.components.zha.core.channels.base] [0xE087:1:0x0702]: failed to get attributes '['current_summ_delivered', 'status']' on 'smartenergy_metering' cluster:
2022-08-26 03:15:10.742 ERROR (MainThread) [zigpy_znp.zigbee.application] Watchdog check failed
Traceback (most recent call last):
  File "/srv/homeassistant/lib/python3.10/site-packages/zigpy_znp/api.py", line 984, in request
    response = await response_future
asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/srv/homeassistant/lib/python3.10/site-packages/zigpy_znp/zigbee/application.py", line 709, in _watchdog_loop
    await self._znp.request(c.SYS.Ping.Req())
  File "/srv/homeassistant/lib/python3.10/site-packages/zigpy_znp/api.py", line 982, in request
    async with async_timeout.timeout(self._znp_config[conf.CONF_SREQ_TIMEOUT]):
  File "/srv/homeassistant/lib/python3.10/site-packages/async_timeout/__init__.py", line 129, in __aexit__
    self._do_exit(exc_type)
  File "/srv/homeassistant/lib/python3.10/site-packages/async_timeout/__init__.py", line 212, in _do_exit
    raise asyncio.TimeoutError
asyncio.exceptions.TimeoutError
2022-08-26 03:15:10.743 DEBUG (MainThread) [zigpy_znp.zigbee.application] Connection lost:
2022-08-26 03:15:10.744 DEBUG (MainThread) [zigpy_znp.uart] Closing serial port
2022-08-26 03:15:10.745 DEBUG (MainThread) [zigpy_znp.zigbee.application] Restarting background reconnection task

I will now try again with the dedicated thread in zigpy znp

dumpfheimer commented 1 year ago

I am not sure if restarting HA couple of times would make things better or worse, but up until now I have had no issue with serialpy + dedicated ZNP thread

dumpfheimer commented 1 year ago

Still no crash πŸ™‚

dumpfheimer commented 1 year ago

Still no crash, I think you're on to something with the serialpy πŸ‘

puddly commented 1 year ago

Was it crashing before, using your serial thread but with pyserial instead?

dumpfheimer commented 1 year ago

Yes, this is the first setup that seems stable with a SimpleLink SDK 6 firmware

dumpfheimer commented 1 year ago

My controller froze over night 😟 will try to find something in the logs

dumpfheimer commented 1 year ago

One thing that just came to my mind: I have quite a few smart plugs witch energy measurement. (TS011F)

When I started using them in December they did not Report some Attributes on their own - and I think they still don't. I believe someone built a timer in zigpy / quirks that does that now (if I remember correctly every 30s).

One thing that MIGHT make a difference is if one of these plugs is unavailable - which was not the case the night before, but was the case tonight.

Maybe route discovery has a memory leak or something similar? Maybe the timer is messing with IO?

dumpfheimer commented 1 year ago

This might be a clue! I went to the device page of an unavailable device (that plug) and went to the electrical measurement cluster and requested the current power few dozen time. My controller crashed within a few hours. Until then the log was full with route discovery requests and errors

dumpfheimer commented 1 year ago

Does someone know if/how the attribute polling could affect the controller/serial loop?

puddly commented 1 year ago

Hmmm. Maybe try dropping the request concurrency? It should be 16 for you. Maybe try 4 or 8? https://github.com/zigpy/zigpy-znp#configuration

I think we should just try to get a proper debug report from one of the TI dev kits of the state of the microcontroller when it crashes and submit a bug report to Texas Instruments.

dumpfheimer commented 1 year ago

I'll try that!

How could I do the dump? I have a headless server. Is there a open source tool in the Ubuntu repositories that could do that?

dumpfheimer commented 1 year ago

Is it possible to initiate a route discovery manually? I would like to try to automatically issue a bunch of route discovery requests to try and see if that might be the issue.

puddly commented 1 year ago

You can:

https://github.com/zigpy/zigpy-znp/blob/824c2b2ade1e2ecfeb55087b9375a1df33eebb34/zigpy_znp/zigbee/application.py#L1071-L1091

Remove the @combine_concurrent_calls decorator to make it send one every time you call the function.

dumpfheimer commented 1 year ago

Correct me if I'm wrong, but should the function/function name not be part of the key? https://github.com/zigpy/zigpy-znp/blob/824c2b2ade1e2ecfeb55087b9375a1df33eebb34/zigpy_znp/utils.py#L184

If I read the code correctly the key would be the same for both functions using @combine_concurrent_calls: https://github.com/zigpy/zigpy-znp/blob/824c2b2ade1e2ecfeb55087b9375a1df33eebb34/zigpy_znp/zigbee/application.py#L1072 https://github.com/zigpy/zigpy-znp/blob/824c2b2ade1e2ecfeb55087b9375a1df33eebb34/zigpy_znp/zigbee/application.py#L722

puddly commented 1 year ago

tasks is created every time combine_concurrent_calls is called (once for every decorated function) and the returned replacement function is a closure that references that same object:

def decorator(func):
    cache = {}

    def replacement(*args, **kwargs):
        print("Current cache is", cache)
        cache[func.__name__] = True

        return func(*args, **kwargs), cache

    return replacement

@decorator
def func1():
    return 1

@decorator
def func2():
    return 2

if __name__ == "__main__":
    assert func1() == (1, {"func1": True})
    assert func2() == (2, {"func2": True})
    assert func2() == (2, {"func2": True})
    assert func1() == (1, {"func1": True})
dumpfheimer commented 1 year ago

Ah, thanks, got it =)

Another question: I don't know if its even possible, but IF the task in combine_concurrent_calls returns early (for whatever reason) and the assertion fails.. Would that mean that the task is never removed from the array? https://github.com/zigpy/zigpy-znp/blob/824c2b2ade1e2ecfeb55087b9375a1df33eebb34/zigpy_znp/utils.py#L194-L195

dumpfheimer commented 1 year ago

had no success with excessive route discovery requests btw =/

puddly commented 1 year ago

It would but I don't think it's possible to not be awaiting the task and for it to not either have an exception, a result, or be cancelled. Any of the three are "done". You'd see a logged error if that happened, which I've not seen so far.

dumpfheimer commented 1 year ago

Thanks!

I'm a bit out of ideas in the mean time.. Can only wait for crahes and see if I can find a pattern..

dumpfheimer commented 1 year ago

I could try to intercept ther serial communication and log that..?

dumpfheimer commented 1 year ago

Could the logger be causing a delay? It's the only thing (IO) left in the thread that is reading from the serial port

puddly commented 1 year ago

Not if you disable it. The loggers intentionally use string formatting syntax to avoid unnecessary work if the log level is disabled.

dumpfheimer commented 1 year ago

I removed all logs from the zigpy thread.. I'll see.. Is your serial port also configured with "UART unknown"?

# setserial /dev/serial/by-id/usb-Texas_Instruments_XDS110__03.00.00.20__Embed_with_CMSIS-DAP_L4300230-if03
/dev/serial/by-id/usb-Texas_Instruments_XDS110__03.00.00.20__Embed_with_CMSIS-DAP_L4300230-if03, UART: unknown, Port: 0x0000, IRQ: 0

There seems to be a Low latency setting for serial ports in Linux.. maybe that could help

dumpfheimer commented 1 year ago

I did not retry this but I before set the max concurrent requests to 64 and it crashed and got unresponsive. Maybe you can reproduce the lock up this way?

dumpfheimer commented 1 year ago

does the zigpy backup honor the maximum concurrent requests? I think I just had a freeze triggered by a backup

puddly commented 1 year ago

Backups don't send requests so that limit doesn't apply. Z-Stack also supports only a single in-flight request at a time, which zigpy-znp honors:

https://github.com/zigpy/zigpy-znp/blob/824c2b2ade1e2ecfeb55087b9375a1df33eebb34/zigpy_znp/api.py#L959-L961

dumpfheimer commented 1 year ago

Thanks again for your fast reply. Could it be that in the background there are incoming messages (eg attribute push mesages) piling up and consuming all the memory?

I think I'll give up for now. I flashed a V5 firmware again. Maybe I can use a spare controller to make a test setup.

dumpfheimer commented 1 year ago

Hey @puddly I just thought my coordinator crashed, but it seems like it was a backup that blocked all requests and stalled them. Is the complete backup done in one sequence without time for requests inbetween?

puddly commented 1 year ago

It's done asynchronously, both happen at the same time and whichever one enqueues a request first goes first.

That being said, only a partial backup is taken at runtime after the complete one is taken on radio startup, and only once every 24 hours. It takes 0.02s for one to run for me.

dumpfheimer commented 1 year ago

Is the serial port supposed to be closed just before the backup?

puddly commented 1 year ago

No, backups just send commands and receive responses, the serial port is never closed at runtime (unless the watchdog closes it after 30s of unresponsiveness).

dumpfheimer commented 1 year ago

Would a backup be made immediately after a reconnect? I seem to have a "closing serial port" message within a few seconds before every Backup