Ember: send ezsp request for extended APS timeouts to sleepy end devices, and adjust application level timeouts

sesuncedu commented 5 years ago

[note: this is observed using the Nortek dongle]

Issue

Messages to rxoff devices may take ~8 seconds before being retrieved from the parent (5 seconds seems to be a common long-poll interval).

The ncp does not seem to automatically populate the address table with these devices, and does not automatically use extended timeouts.

As a result, the ncp retransmits messages too quickly, often resulting in multiple copies of the message being delivered to the end device, multiple acknowdgements coming back, and spurious delivery failures being reported.

If extended timeouts are used, an APS transaction will take be given ~9 seconds before being timed out. If three failed attempts are made, a delivery failure may not be reported for ~27 seconds.

fix

The host can send a ezsp command to set the extended timeout flag for the destination address before sending a unicast to a sleepy node. This approach is the simplest and is used by the python bellows library. Until ash windowing is done, it does add an extra dongle rtt, but this should still be cheaper than extra retries.

A more sophisticated approach would be to cache this information in the ncp address table. It is possible that the ezsp request adds an entry to the table - the reference manual is not clear on this. The setExtendedTimeout request is documented together with other address table requests, but does not accept or return an address table index. The only other way of setting the extended Timeout flag for an address is via replaceAddressTableEntry. This method returns the previous contents of the table entry at an index. There is a method to fetch the extended timeout setting by EUI64, but not for a given address table id.

cdjackson commented 5 years ago

If you have a suggestion on how to set this, please let me know. This is something I've looked at every now and then, but it's not so simple unfortunately since the upper layers need to provide this to the stack, and since it's an Ember specific issue it's not super nice to add a specific function in the transport interface.

If it's needed though, then we can look at doing so.

sesuncedu commented 5 years ago

I'm working this problem now.

The approach I'm taking is to add a sendUnicast method to ZigBeeTransportTransmit that takes a ZigBeeNode as argument. This supports the immediate use case, is quite general (if a node is available, the destination must be unicast), and allows for a trivial default implementation that simply delegates to the existing sendCommand.

cdjackson commented 5 years ago

So are you proposing to stop using the current sendCommand method and use the new unicast method? That doesn't seem like a good option, but maybe I'm not sure how you plean to use this.

I would prefer to keep this separate - eg if needed, then we add a method to allow configuring of a node at the transport layer, and not to make major changes to the API that will break all implementations.

sesuncedu commented 5 years ago

This is non breaking (MINOR) because the method is in addition to the old api, and providing a default implementation of the method means that other implementations don't have to be changed unless they want to.

The bigger issues you bring up are big issues. The layering is a bit twisted - the level 3/level 2 network/mac information only being available above the level 4 aps interface feels odd (ZDO/P seems conceptually all over the shop - it's like having ARP being layered over UDP).

An alternative design could make all the network related information available to the transport level, leaving the calls in place. That would avoid having to change callers that want to use the advanced functionality, but there's only one call site that would be changed, so it's not a huge win.

I'll get the first version done, then maybe will try the second one.

There's a third approach, which would be to have the dongle driver collect and manage its own node info base. It's easy enough to do, but would be a bit wasteful.

The aps timeout in the spec is useless if there are sleepy devices, and it looks like many stacks do the extended timeout thing. There's a note in one of the PICS for stacks that says something about having nodes that might get ack-required messages poll faster than the aps timeout but that would make rx_off pretty useless. I can't wait to get a fix in. Watching triplicate messages getting send from the parent, and triplicate acks come back in response (with a route record escort if low-ram concentrator is on) makes the network look a mess).

cdjackson commented 5 years ago

I’d probably still prefer to stick with the existing method rather than to split out different methods for sending unicast and sending other frames which I guess is what you will have? I think it would be better to have a way to manage the nodes at the transport level directly rather than in the sendCommand method. I think this is what I’d prefer rather than interfering with the sendCommand method.

Maybe I should wait to see your implementation and then comment further though...

I agree that ZDO layering is a bit messy, but that’s the way ZigBee define it - it does span multiple layers unfortunately.

In some ways, keeping this issue contained in the dongle implementation is cleaner (for now anyway), although one day having this information available higher up might be nice. I’m just not sure if the dongle layer has access to the information required to make this decision without decoding descriptors or performing more complex activities itself?

On 11 Oct 2019, at 05:12, Simon Spero notifications@github.com wrote:

This is non breaking (MINOR) because the method is in addition to the old api, and providing a default implementation of the method means that other implementations don't have to be changed unless they want to.

The bigger issues you bring up are big issues. The layering is a bit twisted - the level 3/level 2 network/mac information only being available above the level 4 aps interface feels odd (ZDO/P seems conceptually all over the shop - it's like having ARP being layered over UDP).

An alternative design could make all the network related information available to the transport level, leaving the calls in place. That would avoid having to change callers that want to use the advanced functionality, but there's only one call site that would be changed, so it's not a huge win.

I'll get the first version done, then maybe will try the second one.

There's a third approach, which would be to have the dongle driver collect and manage its own node info base. It's easy enough to do, but would be a bit wasteful.

The aps timeout in the spec is useless if there are sleepy devices, and it looks like many stacks do the extended timeout thing. There's a note in one of the PICS for stacks that says something about having nodes that might get ack-required messages poll faster than the aps timeout but that would make rx_off pretty useless. I can't wait to get a fix in. Watching triplicate messages getting send from the parent, and triplicate acks come back in response (with a route record escort if low-ram concentrator is on) makes the network look a mess).

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/zsmartsystems/com.zsmartsystems.zigbee/issues/787?email_source=notifications&email_token=AAH6IQ7FVZMXNFIL4K4WXITQN74JXA5CNFSM4I6FRCAKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEA6WH2A#issuecomment-540894184, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAH6IQ4GGLTRHWHHNYKLSI3QN74JXANCNFSM4I6FRCAA.

sesuncedu commented 5 years ago

I have a version adds a method to transport transmit to accept a ZigBeeNetworkManager, and have modified the ember code to store it, and use the info to set the extended timeout flag. Now the only retransmits are to a motion sensor whose battery died too young.

cdjackson commented 5 years ago

I’m not so sure that’s such a good idea (although maybe I can be convinced). In the first instance I would have preferred to have maintained separation between the dongle and stack with a clean interface rather than just allowing full access from the transport layer back into the main upper layers.

My feeling is that the upper layers should explicitly control the transport layer as much as possible so that we avoid autonomous “stuff” happening that isn’t expected by the higher layers. Maybe you can convince me that’s unnecessary, although I would still like to keep a clean interface between layers, and just providing the network manager violates that.

On 12 Oct 2019, at 04:06, Simon Spero notifications@github.com wrote:

I have a version adds a method to transport transmit to accept a ZigBeeNetworkManager, and have modified the ember code to store it, and use the info to set the extended timeout flag. Now the only retransmits are to a motion sensor whose battery died too young.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/zsmartsystems/com.zsmartsystems.zigbee/issues/787?email_source=notifications&email_token=AAH6IQ5ZPZMMLBH3AL3LBJTQOE5KFA5CNFSM4I6FRCAKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEBBT54A#issuecomment-541277936, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAH6IQ62BRVTJPU4BGLJEX3QOE5KFANCNFSM4I6FRCAA.

zsmartsystems / com.zsmartsystems.zigbee

Ember: send ezsp request for extended APS timeouts to sleepy end devices, and adjust application level timeouts #787

Issue

fix