End Nodes: Guidelines for communication with S2+Supervision

AlCalzone commented 1 year ago

A lot of issues in user's networks come down to too much parallel communication. While we have a few guidelines on how to configure devices optimally to prevent this, often it is not possible due to how the devices are implemented. Especially when S2 and Supervision are involved, things can easily go sideways.

Z-Wave Communication Basics

To better understand the issue, let's take a look at how Z-Wave communication works at a high level.

Basic communication flow:

sequenceDiagram
    participant Z as Z-Wave JS
    participant C as Controller
    participant N as End Node

    Z->>C: Send this command to Node
    activate Z
    C->>Z: I've started sending the command
    activate C
    note over C: Tries to reach N
    C->>N: Here's a command
    N->>C: ACK
    C->>Z: Node got the command
    deactivate Z
    deactivate C

    note over Z,C: ready for the next command

This process is typically very fast (~10ms), but can take several seconds when the controller has trouble reaching the node.

The important part to remember here is that this entire flow needs to be completed before another command can be sent.

Basic communication flow with status updates:

Even if the node got the command, this does not mean it could understand it or even executed it. However, applications usually want to know if a command was executed, e.g. if a light was turned on or a door was unlocked. To guarantee that, Z-Wave JS waits for the node to report its new status. If that doesn't happen within a few seconds, it queries the current status. For simplicity, the controller and protocol-level ACKs are omitted from the following flow:

sequenceDiagram
    participant Z as Z-Wave JS
    participant N as End Node

    Z->>N: Send SET command

    note over N: processes command

    opt If node does not report status
        Z->>N: Send GET command
    end

    N->>Z: REPORT status

That status update does not require a response outside of the protocol-level ACK, which is sent automatically by the controller.

When the node does not automatically send status reports (or does not understand the command), this can lead to a couple of seconds of uncertainty until the status has been queried.

Supervised commands:

By using Supervision CC, the node is required to respond whether it understood and executed the command:

sequenceDiagram
    participant Z as Z-Wave JS
    participant N as End Node

    Z->>N: Send SET command, with Supervision

    note over N: processes command

    N->>Z: Supervision REPORT, including command status

This eliminates the uncertainty and it reduces the number of commands to 2 (instead of 2 or 3). Since Z-Wave has very limited bandwidth shared by up to 232 nodes, so reducing the number of commands needed for each action is beneficial.

Encrypted commands:

When encryption is involved, things become a little more complicated. The older standard Security S0 is notorious for adding up to 2 commands overhead for each exchanged command, because it requests a nonce from the target:

sequenceDiagram
    participant Z as Z-Wave JS
    participant N as End Node

    Z->>N: GET Nonce

    N->>Z: Nonce

    note over Z: encrypts command

    Z->>N: encrypted command

    note over N: tries to decrypt command

Like before, it is unclear if the target node understood the command unless it sends an update, so this exchange may be followed up with a GET and a REPORT, each time exchanging new nonces before.

Security S2 does this better by establishing a shared encryption state which does not need any nonce exchange unless there are communication failures involved and one party gets out of sync.

sequenceDiagram
    participant Z as Z-Wave JS
    participant N as End Node

    note over Z,N: establish shared state

    Z->>N: encrypted command
    note over N: decrypt command

    N->>Z: encrypted response
    note over Z: decrypt response

In case of a decryption failure, the target responds with a nonce report, which will cause the sender to re-transmit its command including a its nonce to re-sync the shared state:

sequenceDiagram
    participant Z as Z-Wave JS
    participant N as End Node

    note over Z,N: encryption out of sync

    Z->>N: encrypted command
    note over N: fails to decrypt
    N->>Z: nonce report

    Z->>N: re-transmit encrypted command, with nonce

    note over N: decrypt command
    note over Z,N: shared encryption state in sync

So in order to handle cases where the target cannot decrypt the command, the sender would have to wait for a potential nonce report, so it can re-transmit the command:

sequenceDiagram
    participant Z as Z-Wave JS
    participant N as End Node

    Z->>N: encrypted command
    note over Z: wait for nonce

    alt node could decrypt
        note over N: decrypt command
        note over Z: timeout
    else node failed to decrypt
        note over N: fails to decrypt
        N->>Z: nonce report
        Z->>N: re-transmit encrypted command, with nonce
        note over N: decrypt command
    end

    note over Z,N: ...next commands...

While this does work, it introduces unnecessary delays. The nonce report can easily take 0.5 to 1s to be delivered, so the sender should wait at least this long, even if the command was processed within 10ms. This is fine though if few messages need to be delivered (e.g. 2-3 reports from a node to the controller), but very noticeable when trying to control many devices (e.g. when a user wants to turn on 10+ devices).

Supervision to the rescue?

Again, Supervision CC can help with this. It requires the target to respond, so it increases the throughput for successful transmissions:

sequenceDiagram
    participant Z as Z-Wave JS
    participant N as End Node

    Z->>N: encrypted command, with Supervision
    note over N: processes command
    N->>Z: encrypted Supervision REPORT

    note over Z,N: ...next commands...

AlCalzone commented 1 year ago

The Problem, Part 1

This all seems fine, but a problem arises when end nodes use S2+Supervision for their reporting. Remember that Supervision requires the target to respond, so a typical status update will look like this:

sequenceDiagram
    participant Z as Z-Wave JS
    participant N as End Node

    N->>Z: encrypted command, with Supervision
    note over Z: processes command
    Z->>N: encrypted Supervision REPORT

Compared to end nodes, Z-Wave JS / the Controller is often sending many more commands, e.g. when controlling multiple devices at once. Incoming supervised commands will be handled ASAP, but only after a complete send->ACK flow.

We'll assume that all following commands are using S2 encryption.

sequenceDiagram
    participant Z as Z-Wave JS
    participant N2 as Node 2
    participant N3 as Node 3
    participant N4 as Node 4

    Z->>N2: command
    activate Z
    note left of Z: busy / waiting for ACK
    N4-->>Z: supervised command
    activate N4
    note right of N4: waiting for Supervision REPORT
    N2->>Z: ACK
    deactivate Z

    Z-->>N4: Supervision REPORT
    deactivate N4

    Z->>N3: command
    activate Z
    note left of Z: busy / waiting for ACK
    N3->>Z: ACK
    deactivate Z

Impatient End Nodes vs. sensitive 700 series controllers:

It can happen that sending a command to a node takes longer than expected. Multiple seconds are not unheard of, especially in imperfect networks. Often, end nodes time out waiting for the Supervision Report very quickly and re-transmit it multiple times:

sequenceDiagram
    participant Z as Z-Wave JS
    participant N2 as Node 2
    participant N4 as Node 4

    note over N2: flaky / slow<br>connection

    Z->>N2: command
    activate Z
    note left of Z: busy / waiting for ACK

    N4-->>Z: supervised command
    activate N4
    note right of N4: waiting
    N4-->>Z: supervised command (re-transmit)
    note right of N4: 500ms later
    N4-->>Z: supervised command (re-transmit)
    note right of N4: 500ms later
    N4-->>Z: supervised command (re-transmit)
    note right of N4: 500ms later
    N4-->>Z: supervised command (re-transmit)

    N2->>Z: ACK
    deactivate Z

    Z-->>N4: Supervision REPORT
    deactivate N4

I've seen this be repeated 10x or more.

In contrast to 500 series controllers, 700 series controllers seem to have more trouble transmitting when there's lots of traffic on the network. In situations like the one above, the incoming traffic itself often causes the outgoing message to take much longer to be transmitted, which causes more messages to be re-transmitted, which causes the outgoing message to take even longer, ... Up to the point where the controller gives up and won't transmit at all for a short time.

Supervision for everything This problem can be magnified by nodes which report everything using Supervision. Imagine a situation where a user turns on 5-10 power metering relays, and all of them immediately send supervised Meter Reports (V, A, W). We're looking at 15-30 incoming commands which all expect a response within a second or so, all while the controller is potentially still busy controlling devices. If all of them re-transmit eagerly, this is a flood of messages that quickly brings the entire network to its knees.

sequenceDiagram
    participant Z as Z-Wave JS
    participant N2 as Node 2
    participant N3 as Node 3
    participant N4 as Node 4
    participant N5 as Node 5

    Z->>N2: Turn ON
    activate Z
    N2->>Z: ACK
    deactivate Z
    note left of Z: 1st device ON

    Z->>N3: Turn ON
    activate Z
    N2-->>Z: supervised (A) report
    activate N2
    N3->>Z: ACK
    deactivate Z
    note left of Z: 2nd device ON

    Z-->>N2: Supervision REPORT (A)
    deactivate N2

    N2-->>Z: supervised (W) report
    activate N2
    Z-->>N2: Supervision REPORT (W)
    deactivate N2

    Z->>N4: Turn ON
    activate Z
    N3-->>Z: supervised (A) report
    activate N3
    N3-->>Z: supervised (A) report (re-transmit)
    N4->>Z: ACK
    deactivate Z
    note left of Z: 3rd device ON

    N4-->>Z: supervised (A) report
    activate N4

    N4-->>Z: supervised (A) report (re-transmit)

    Z-->>N3: Supervision REPORT (A)
    deactivate N3

    N3-->>Z: supervised (W) report
    activate N3

    Z-->>N4: Supervision REPORT (A)
    deactivate N4

    N4-->>Z: supervised (W) report
    activate N4

    Z-->>N3: Supervision REPORT (W)
    deactivate N3

    N4-->>Z: supervised (W) report (re-transmit)
    Z-->>N4: Supervision REPORT (W)
    deactivate N4

    note left of Z: Longer and longer delays<br>between commands

    Z->>N5: command
    activate Z
    N5->>Z: ACK
    deactivate Z
    note left of Z: 4th device ON

... imagine this for 10 or more devices. Suddenly the end nodes control the communication in the network, and not the controller.

AlCalzone commented 1 year ago

The Problem, Part 2

Remember that I wrote how Supervision is a way to avoid unnecessary status queries because the Supervision Report tells the controlling node that the controlled node has executed the command and is now in the desired state? It turns out that many devices still send an unsolicited update with their new status, even if controlled using Supervision. That unsolicited update uses Supervision of course, so each command now needs at least 4 instead of 2 messages:

sequenceDiagram
    participant Z as Z-Wave JS
    participant N2 as Node 2

    Z->>N2: Dim to 50% (Supervised)
    activate Z
    N2-->>Z: ACK
    deactivate Z

    N2->>Z: Supervision Report: SUCCESS
    note left of Z: Knows that Node 2<br> is at 50% brightness

    note over Z,N2: ↓ This is completely unnecessary ↓
    N2->>Z: reports brightness (supervised)
    Z-->>N2: ACK
    Z->>N2: Supervision REPORT: SUCCESS

This quickly gets ugly when multiple nodes are involved and the ones sending unnecessary supervised reports get impatient:

sequenceDiagram
    participant Z as Z-Wave JS
    participant N2 as Node 2
    participant N3 as Node 3

    Z->>N2: Dim to 50% (supervised)
    activate Z
    N2-->>Z: ACK
    deactivate Z

    Z->>N3: Dim to 50% (supervised)
    activate Z
    note left of Z: busy / waiting for ACK

    N2->>Z: Supervision Report: SUCCESS
    note left of Z: knows Node 2 is at 50%

    N2->>Z: reports 50% brightness (supervised)
    activate N2
    note over N2: 500ms later
    N2->>Z: reports 50% brightness (supervised, re-transmit)

    N3-->>Z: ACK
    deactivate Z

    N2->>Z: reports 50% brightness (supervised, re-transmit)
    Z->>N2: Supervision REPORT: SUCCESS
    deactivate N2

    N3->>Z: Supervision Report: SUCCESS
    note left of Z: knows Node 3 is at 50%

    N3->>Z: reports 50% brightness (supervised)
    activate N3
    Z->>N3: Supervision Report: SUCCESS
    deactivate N3

again, imagine this for 10+ nodes. This is how it should look like, even if some nodes are slower to respond:

sequenceDiagram
    participant Z as Z-Wave JS
    participant N2 as Node 2
    participant N3 as Node 3
    participant N4 as Node 4
    participant N5 as Node 5
    participant N6 as Node 6
    participant N7 as Node 7
    participant N8 as Node 8
    participant N9 as Node 9
    participant N10 as Node 10

    Z->>+N2: Dim to 50% (supervised)
    activate Z
    N2-->>Z: ACK
    deactivate Z

    Z->>+N3: Dim to 50% (supervised)
    activate Z
    N2->>-Z: Supervision Report: SUCCESS
    note left of Z: knows Node 2 is at 50%
    N3-->>Z: ACK
    deactivate Z

    N3->>-Z: Supervision Report: SUCCESS
    note left of Z: knows Node 3 is at 50%

    Z->>+N4: Dim to 50% (supervised)
    activate Z
    N4-->>Z: ACK
    deactivate Z

    Z->>+N5: Dim to 50% (supervised)
    activate Z
    N5-->>Z: ACK
    deactivate Z

    N5->>-Z: Supervision Report: SUCCESS
    note left of Z: knows Node 5 is at 50%
    N4->>-Z: Supervision Report: SUCCESS
    note left of Z: knows Node 4 is at 50%

    Z->>+N6: Dim to 50% (supervised)
    activate Z
    N6-->>Z: ACK
    deactivate Z

    Z->>+N7: Dim to 50% (supervised)
    activate Z
    N6->>-Z: Supervision Report: SUCCESS
    note left of Z: knows Node 6 is at 50%
    N7-->>Z: ACK
    deactivate Z

    Z->>+N8: Dim to 50% (supervised)
    activate Z
    N8-->>Z: ACK
    deactivate Z

    Z->>+N9: Dim to 50% (supervised)
    activate Z
    N9-->>Z: ACK
    deactivate Z

    N7->>-Z: Supervision Report: SUCCESS
    note left of Z: knows Node 7 is at 50%
    N9->>-Z: Supervision Report: SUCCESS
    note left of Z: knows Node 9 is at 50%

    Z->>+N10: Dim to 50% (supervised)
    activate Z
    N8->>-Z: Supervision Report: SUCCESS
    note left of Z: knows Node 8 is at 50%
    N10-->>Z: ACK
    deactivate Z
    N10->>-Z: Supervision Report: SUCCESS
    note left of Z: knows Node 10 is at 50%

AlCalzone commented 1 year ago

The Problem, Part 3

Just combine parts 1 and 2: Nodes which...

unnecessarily use Supervision for unsolicited reports
get impatient and re-transmit when the response does not come within 500ms or so
unnecessarily report their status after being controlled using Supervision
and do so with Supervision.

That's the sad reality today.

AlCalzone commented 1 year ago

The Solution

As I wrote earlier, Z-Wave traffic is a sparse resource and needs to be used accordingly. So the primary goal is to avoid unnecessary traffic altogether.

Parts of this may not be applicable to all reports a device sends - after all you may need to make sure that some critical reports are received and understood - like an empty battery for a smoke sensor. But I think the above issues can be resolved with a few changes to communication strategy:

Do not send unsolicited updates for states controlled using Supervision The controller using Supervision receives a Supervision Report and knows the state was updated. Avoiding the unnecessary report reduces the traffic at least by 50%, even more when re-transmissions are involved. Leave it to the controller to query for state updates if something is still uncertain.

Do not use Supervision for everything As I outlined previously, with S2 it is possible to achieve the goal of knowing if the target understood a message simply by waiting a short time for a Nonce Report - at least with a high certainty. Whenever possible, end nodes should use this strategy for all non-critical reports. A positive side effect of this is that the communication gets spaced out a bit, reducing the burst of traffic/noise on the network. This has a chance of reports being dropped in the case that an ACK was received, but Nonce Report was lost. However I think the communication will be far more reliable when there's much less traffic, so the chances of this are slim.

Use a configurable backoff strategy for re-transmits Do not simply re-transmit supervised commands every 500ms (or less). Instead, increase the delay on each attempt, e.g. 500 -> 1000 -> 1500 -> ... up to a maximum. Users should be able to configure this in order to fine-tune the behavior for their network. For example, Ring devices let users configure the following parameters:

Supervision Report Timeout (0.5 - 30s)
Number of Re-Transmit attempts (0-5)
Delay between Re-Transmit attempts (1-60s) or
Backoff Multiplier -> Delay = (Attempt No.) * (this multiplier) + random delay

Maybe: Make Supervision usage configurable Let users decide which reports should be sent using Supervision (and which should simply wait for Nonce Reports)

Choose sane defaults for reporting configuration: Many device defaults (e.g. for power meters) are tuned for "realtime" responses. This may be fine in small networks, but becomes a huge problem as the network grows. I've compiled my recommendations on this here in more detail.

zwave-js / node-zwave-js