Node discovery takes 3 minutes for a pairing

triller-telekom commented 3 years ago

I am currently investigating a problem where it takes very long to discover a zigbee device (3 minutes and 9 seconds), i.e. reading out its descriptors, endpoints, etc.

What I found out so far is, that there is a huge gap from scheduling a task, until it actually starts:

2020-10-07T09:09:48.771+0200   476  DEBUG          [symbolicName=com.zsmartsystems.zigbee, className=com.zsmartsystems.zigbee.app.discovery.ZigBeeNodeServiceDiscoverer] - 14B457FFFEC4E8CB: Node SVC Discovery: scheduled [NWK_ADDRESS, POWER_DESCRIPTOR, NODE_DESCRIPTOR, ACTIVE_ENDPOINTS]

2020-10-07T09:10:30.289+0200   476  DEBUG          [symbolicName=com.zsmartsystems.zigbee, className=com.zsmartsystems.zigbee.app.discovery.ZigBeeNodeServiceDiscoverer] - 14B457FFFEC4E8CB: Node SVC Discovery: running NWK_ADDRESS

This continues for basically all of the above tasks, so it sums up to the 3 minutes.

The only explanation I have so far is:

/**
* Executor service to execute update threads for discovery or mesh updates etc.
* We use a {@link ZigBeeExecutors.newScheduledThreadPool} to provide a fixed number of threads as otherwise this
* could result in a large number of simultaneous threads in large networks.
*/
private final ScheduledExecutorService executorService = ZigBeeExecutors.newScheduledThreadPool(6,
            "NetworkManager");

So there are only 6 threads available for such tasks to be run in parallel.

On the startup of the ZigbeeDiscoveryExtension, we create a ZigBeeNetworkDiscoverer and tell it to start a node discovery from network node "0". We collect all associated devices and add those to the network manager. I am assuming that the network state must be ONLINE at this point, because I assume the listener for nodeAdded will be triggered and thus we start a discovery (with tasks occupying threads from the pool, mentioned above) for all "associated nodes". That is because ZigbeeDiscoveryExtension.startDiscoveryIfNecessary() has no discoverer for each node and thus it creates them.

The other scenario where we could start a discovery for all nodes, is when we load the nodes from the storage, however, I think the network state is not ONLINE at this point in time and thus no discovery will be triggered.

Also: I have identified 4 "broken" ZigBeeNodes in the particular system, which are nodes that exist, but only have a IEEE address and no endpoints, descriptors, etc. So they are left overs from a broken pairing/deletion of a device, whatever. Those 4 devices would take 4 threads (continuously failing because they are not reachable) and I am wondering why the 2 other threads are also occupied. The only explanation I have is what I wrote above: That we start a discovery for ALL nodes on startup.

So, i think we might run into a problem if there are 6 devices in the network, that not reachable at startup. Because if they occupy all threads and run into timeouts -> retries -> timeouts, it will take a long time until we are able to start a discoverer for a pairing of a new device.

triller-telekom commented 3 years ago

Some further investigations on this, as I think I found the reason why discoveries are started on the startup of ZSS:

Once we set the network to ONLINE in ZigBeeNetworkManager#setNetworkStateOnline(), we notify all nodes:

https://github.com/zsmartsystems/com.zsmartsystems.zigbee/blob/84042f7bde4dc1f9476e36ebfb1bd2ae4d18b812/com.zsmartsystems.zigbee/src/main/java/com/zsmartsystems/zigbee/ZigBeeNetworkManager.java#L1330-L1337

Since the nodeDiscoverer is null here:

https://github.com/zsmartsystems/com.zsmartsystems.zigbee/blob/84042f7bde4dc1f9476e36ebfb1bd2ae4d18b812/com.zsmartsystems.zigbee/src/main/java/com/zsmartsystems/zigbee/app/discovery/ZigBeeDiscoveryExtension.java#L252-L254

We start a new discovery for this node. And we add the meshUpdateTasks, defined as:

https://github.com/zsmartsystems/com.zsmartsystems.zigbee/blob/84042f7bde4dc1f9476e36ebfb1bd2ae4d18b812/com.zsmartsystems.zigbee/src/main/java/com/zsmartsystems/zigbee/app/discovery/ZigBeeDiscoveryExtension.java#L82-L89

here:

https://github.com/zsmartsystems/com.zsmartsystems.zigbee/blob/84042f7bde4dc1f9476e36ebfb1bd2ae4d18b812/com.zsmartsystems.zigbee/src/main/java/com/zsmartsystems/zigbee/app/discovery/ZigBeeDiscoveryExtension.java#L295

Now I am wondering what we should do about this comment:

We also default to requesting the network address to ensure that it hasn't changed.

If we always request the network address per node, we will add one task to the thread pool per device on startup. If there are multiple nodes that re currently turned off (no power for example), we will will try to reach them multiple times, due to our retries until we eventually give up. During these retries, there are some chances for other tasks to be run, however, they might take some time, see my comment above.

If we have 6 or more unreachable devices, the situation becomes even worse, because then all threads from the pool might be occupied and if the user wants to start a pairing in that time-span, chances are high that this will take a very long time (and the application that initiated the pairing might have even timed out then).

So the questions would be:

Do we really need to request the network address for mesh tasks? (if we do not have it, we request it anyway, see https://github.com/zsmartsystems/com.zsmartsystems.zigbee/blob/84042f7bde4dc1f9476e36ebfb1bd2ae4d18b812/com.zsmartsystems.zigbee/src/main/java/com/zsmartsystems/zigbee/app/discovery/ZigBeeNodeServiceDiscoverer.java#L743-L746)
If we do need to request the network address for mesh to ensure we have the up2date one, in case it has changed, can we skip this on startup and only do it on regular mesh update runs?

cdjackson commented 3 years ago

Do we really need to request the network address for mesh tasks?

The theory was that the NWK address can change. This is especially true for battery devices where they can change parent, or could leave/rejoin for a number of reasons, and this will result in them having a different address.

All that said, I'm happy to discuss any suggestions here @triller-telekom as I do think we're probably being too conservative. Probably we need a more optimistic approach - eg not to rediscover the network so often, and not to assume that the NWK address has changed, and only perform these sort of checks on exception (eg if we have a transaction failure or something).

These concepts came originally from the ZigBee4Java project, and possibly the ZigBee4Osgi before that, but I do think there is a better way to manage the discovery of devices than crawling the neighbough tables etc and reducing some of this "noise" would likely improve things...

Ok... So...

I guess there are 2 issues here?

the number of threads is potentially an issue. We could either just increase the number of thread s (probably it's ok) or we could look to recode this to run in a single thread for all devices (which would be a lot more work). I think that increasing the number of threads here is probably ok.
the discovery needs further work. I've done some improvements on this recently with removal of some of the broadcasts, and there is also #1108 that is also linked here, and I think we can probably reduce the crawling of the neighbour tables to find devices - this is really only needed if you've completely lost the network and need to find all devices again, although in time they are likely to be rediscovered anyway.

I'm open to suggestions / PRs here :)

triller-telekom commented 3 years ago

All that said, I'm happy to discuss any suggestions here @triller-telekom as I do think we're probably being too conservative.

I agree with that. But I also think that there is nothing wrong with being conservative, because this way we have a "working" system, even if something changes in the network.

not to rediscover the network so often, and not to assume that the NWK address has changed, and only perform these sort of checks on exception (eg if we have a transaction failure or something).

I had something like this in mind while analyzing it, sounds plausible to me.

To your 2 points:

I am fine with increasing the number of threads, although we know that this is only weakening the problem and not "solving" it.
Re-working the discovery, by not crawling the neighbour tables is certainly a bigger task and change. However, I agree that we do not have to traverse all neighbour tables on every startup. Maybe we need a mechanism that can be triggered by the application to rediscover a node (I think we have that already, as mentioned in the linked openHAB ticket). So every time the application wants to communicate with an IEEE address where we do not have a corresponding ZigBeeNode for, the apllication has to trigger such a rediscovery.

Back to my questions above, as I think they might be a first (and easier) step towards a slimmer discovery:

Do you think its feasible to remove the network address request task from the mesh discovery completely, as we do add that task later anyway, in the case we do not know the nwk address? If not, then we certainly should build in a mechanism to skip this on startup, so not flood the network or our transaction manager with too many requests.

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

zsmartsystems / com.zsmartsystems.zigbee

Node discovery takes 3 minutes for a pairing #1153