zigpy / zigpy-znp

TI CC2531, CC13x2, CC26x2 radio support for Zigpy and ZHA
GNU General Public License v3.0
145 stars 40 forks source link

Coordinator freezes with SimpleLink SDK >= 6.x.x #165

Open dumpfheimer opened 2 years ago

dumpfheimer commented 2 years ago

When using koenkk`s development firmware, that is built on SimpleLink SDK Version 6.10 or 6.20, the coordinator seems to freeze under certain conditions.

It seems like memory issues that causes the lock-ups, which might be triggered at high loads or simply after some time by chance.

I believe this might be an issue only since a month or two and, while I do believe the root cause is somewhere in the coordinator firmware, I think it was fairly recent changes in zigpy that started triggering the bug. This is why:

I have had the development firmware from Feb running since Feb without issues. Some time a month ago the issues started (I was very likely on zigpy dev) when I upgraded the firmware to the latest dev build from koenkk. Downgrading to Feb firmware did not fix the issue. I had to downgrade all the way to a SDK 5.x.x Version to have a stable environment again.

Have there been noticable changes in July (+/-)?

@puddly you commented on an issue I created here and mentioned RAM usage here .

Is there something that could be done within zigpy to reduce memory usage on the controller (without loss of function mentioned here )? Or do you, zigpy devs, believe this must be fixed in SimpleLink?

Thanks for your work, it's very much appreciated.

puddly commented 2 years ago

A backup is taken the moment the radio starts up so if the serial port loses connection and zigpy-znp reconnects, a new backup will be taken every time.

I've modified my local setup to take a complete backup over and over in the background, with a 0 second delay between each one. I experience only a tiny delay sending requests but otherwise no noticeable impact so far in the past 10 minutes. This with the same beta firmware, on the same TI CC1352p dev kit with no flow control enabled.

dumpfheimer commented 2 years ago

Then I need to find out why my device is seemingly randomly disconnecting

dumpfheimer commented 2 years ago

Any idea what could cause this? Last log lines before close (Did not shut down HA)

2022-08-30 20:06:34.467 DEBUG (MainThread) [homeassistant.components.zha.core.gateway] Shutting down ZHA ControllerApplication
2022-08-30 20:06:34.472 DEBUG (MainThread) [homeassistant.components.zha.core.device] [0x2462](LCT003): last_seen is 11571.180652618408 seconds ago and ping attempts have been exhausted, marking the device unavailable
2022-08-30 20:06:34.472 DEBUG (MainThread) [homeassistant.components.zha.core.device] [0x2462](LCT003): Update device availability -  device available: False - new availability: False - changed: False
2022-08-30 20:06:34.489 DEBUG (MainThread) [zigpy_znp.api] Sending request: SYS.ResetReq.Req(Type=<ResetType.Soft: 1>)
2022-08-30 20:06:34.490 DEBUG (MainThread) [zigpy_znp.api] Request has no response, not waiting for one.
2022-08-30 20:06:34.491 DEBUG (MainThread) [zigpy_znp.uart] Closing serial port
dumpfheimer commented 2 years ago

Aah, its the "Update configuration" button in the UI (Integration page). Is this to be expected?

puddly commented 2 years ago

Yeah. It fully reloads ZHA when adjusting configuration to be safe but we can probably make it less intrusive, eventually.

dumpfheimer commented 2 years ago

Is the pyserial .write method thread/concurrency safe?

puddly commented 2 years ago

Almost nothing in asyncio is threadsafe so I wouldn't rely on it.

dumpfheimer commented 2 years ago

I ordered a CC1352P7. It seems to be pretty much the same as the P2 except that it has more memory.

On a side note: I yesterday tried and succeeded with creating a backup from ZNP and restoring it on my Conbee II. This is rediculously genius! (Reverted back because ZNP seemed to work much more reliably)

dumpfheimer commented 1 year ago

Got the board, made a P7 firmware with some changes from koenkk's. Is up and running :-) for now..

dumpfheimer commented 1 year ago

@puddly is zigpy-znp purposely suprressing route discovery mechanisms? https://github.com/zigpy/zigpy-znp/blob/824c2b2ade1e2ecfeb55087b9375a1df33eebb34/zigpy_znp/zigbee/application.py#L292

If my quick search was correct the other libraries seem to not use a similar flag?

dumpfheimer commented 1 year ago

I now seem to have an extremely well performing network in comparison to before with:

puddly commented 1 year ago

If my quick search was correct the other libraries seem to not use a similar flag?

If I remember correctly, it was used by other libraries in the past, though incorrectly named: https://github.com/Koenkk/zigbee-herdsman/search?q=DISCV_ROUTE

MTORR are broadcast periodically by the coordinator (check with a Zigbee sniffer), in addition to being explicitly requested by zigpy-znp when a device is unreachable. I believe the original reasoning was to reduce unnecessary runtime network traffic.

dumpfheimer commented 1 year ago

Just for completeness, here are the definitions from zigpy, z2m and z-stack:

zigpy:

    SUPPRESS_ROUTE_DISC_NETWORK = 0x20      # dec 32
    SKIP_ROUTING = 0x80             # dec 128

z2m

DISCV_ROUTE: 32,
SKIP_ROUTING: 128

Z-Stack Stack/af/af.h

#define AF_SUPRESS_ROUTE_DISC_NETWORK      0x20   // Supress Route Discovery for intermediate routes
                                                  // (route discovery preformed for initiating device)
#define AF_SKIP_ROUTING                    0x80 #dec 128

It seems like the search does not find any usages of the option. Could of course be in another project, though.

It seems to me from the comment in af.h that AF_SUPRESS_ROUTE_DISC_NETWORK should be used during joining only?

puddly commented 1 year ago

It seems to me from the comment in af.h that AF_SUPRESS_ROUTE_DISC_NETWORK should be used during joining only?

The only documentation is that single comment and from what I recall, these flags are processed by the closed-source portions of Z-Stack. My understanding is that it disables unnecessary unicast route discovery requests, since Z-Stack will be doing its own route discovery broadcasts.

There are discussions about the different approaches to routing and their use cases within the Z-Stack developer guide: Z-Stack 3.0 Developer's Guide.pdf

dumpfheimer commented 1 year ago

from Stack/af/af.c:

  if ( options & AF_SUPRESS_ROUTE_DISC_NETWORK )
  {
    req.discoverRoute = DISC_ROUTE_INITIATE;
  }
  else
  {
    req.discoverRoute = AF_DataRequestDiscoverRoute;
  }

from Stack/nwk/nl_mede.h:

// Route Discovery Options
#define DISC_ROUTE_NONE     0x00  // Don't discover route
#define DISC_ROUTE_NETWORK  0x01  // If a route is needed, the device (also
                                  // intermediate router) will issue  a route
                                  // disc request.
#define DISC_ROUTE_INITIATE 0x04  // Only the source router initiates route req.

Also: _AFDataRequestDiscoverRoute seems to always be _DISC_ROUTENETWORK

So, I would read it this way: If the flag ist SET: Only the source router initiates route req. If the flag is NOT SET: If a route is needed, the device (also intermediate router) will issue a route disc request.

Not sure what to do with this information, though 😂