zigpy / zha

Zigbee Home Automation
Apache License 2.0
8 stars 2 forks source link

Misleading "Zigbee channel 25 utilization is 91.06%!" message should be removed from logs #51

Open alexruffell opened 3 weeks ago

alexruffell commented 3 weeks ago

I keep seeing the warning "Zigbee channel X utilization is YY.YY%!" in my logs, which I find misleading and of no real use as currently implemented. I advocate for the removal of this warning, as it likely causes confusion and unnecessary concern for users trying to improve their systems.

Like many others, while investigating Zigbee-related issues, I noticed this warning and attempted to move to another Zigbee channel. Changing the channel did not improve the mesh's reliability, and I continued to receive the same warning for the new channel. I suspect that the utilization percentage includes my mesh's Zigbee traffic, which is not an issue that needs addressing. I would prefer to be alerted to excessive traffic (messages on the mesh), which could lead to high channel utilization, as it would lead me to investigate the mesh for chatty and/or unruly devices (giving me a list of chatty devices over time would be awesome too!).

The warning also states "If you are having problems joining new devices, are missing sensor updates, or have issues keeping devices joined, ensure your coordinator is away from interference sources such as USB 3.0 devices, SSDs, WiFi routers, etc." This is correct, but it implies that high utilization is due to interference from such sources. While this may sometimes be the case, it is not always true, and this assumption can be misleading. To investigate further, I purchased a low-cost spectrum analyzer called TinySA Ultra to survey my spectrum.

In the screen capture below, you can see that channel 25 (graph centered on ch25 - 2.475 GHz) has a very low noise floor with no activity. The yellow trace showing instantaneous activity and the red trace showing the maximum signal hold (past and present activity). I ran the TinySA in this mode for several minutes and confirmed that channel 25 was clear for my mesh network.

image

I moved my mesh network from channel 20 to channel 25 and monitored it with the TinySA for 20 minutes while waiting for the mesh network to settle on the new channel. Some sleepy devices might not have joined yet, but the bulk of the traffic clearly moved, as seen in the following screen capture:

image

Despite ch20 being free and clear, the equally high utilization warnings now point to channel 25, which is misleading and not useful for troubleshooting the mesh network.

As far as I know, Zigbee channel utilization typically includes all radio frequency activity on the channel, which includes the noise floor, interference from other devices, and the actual Zigbee traffic. Therefore, the utilization percentage reflects the total level of activity on the channel, not just the Zigbee traffic. Unless the coordinator can remove the mesh network's traffic leaving only undesired activity on the channel, the metric is of little to no usefulness, and more often simply misleading.

The channel utilization metric would make sense for every Zigbee channel except the one in use by ZHA for its mesh. For example, when a user is picking a new channel to form a new zigbee network, or relocate an existing one, it may be useful for the user to be alerted to high channel utilization of the chosen channel before allowing its use. This is the only scenario where this metric appears to make sense.

puddly commented 3 weeks ago

Unless the coordinator can remove the mesh network's traffic leaving only undesired activity on the channel, the metric is of little to no usefulness, and more often simply misleading.

It'll be removed in the future when we implement something more useful, like CCA transmit failure counting. This unfortunately is not implemented ideally right now but the false positives' annoyance is heavily outweighed by the usefulness of knowing when problems for small networks are external.

Some alternatives to explore in the future:

  1. Reducing the scan interval to the absolute minimum and performing 10-20 of them in a row, to build a distribution (instead of just a firmware average). It may be possible to extract a noise floor from there.

  2. Checking to see if the signal strength of the closest router (i.e. one with the highest signal strength) is close to the energy scan result and silence the warning in that case.

  3. See if it's possible to get the firmware to report the noise floor that sits outside of 802.15.4 traffic.

puddly commented 3 weeks ago

For fun, here are the results of varying the scan duration exponent while adjusting the number of scans to compensate. This should keep the scan time roughly similar while collecting more data. This is done on my home Zigbee network. It looks like the radio reports the maximum for each scan interval, which effectively removes the noise floor when higher exponents are used on a network with significant traffic: output