zigbee2mqtt / hassio-zigbee2mqtt

Official Zigbee2MQTT Home Assistant add-on
https://www.zigbee2mqtt.io
Apache License 2.0
1.19k stars 435 forks source link

Zigbee2MQTT Proxy periodically stops logging and begins to gradually increase RAM usage. #580

Closed VladimirTuzovGitHub closed 7 months ago

VladimirTuzovGitHub commented 8 months ago

Description of the issue

I am using SLZB-06P and SLZB-06P7 coordinators and installing Zigbee2MQTT in an LXC Proxmox container using the tteck script installation method (Proxmox VE Helper-Scripts). Initially I used the SLZB-06P coordinator and the Zigbee2MQTT Proxy addon in Home Assistant. I discovered a problem - the zigbee network works stably with 65 devices and the info log is constantly running, but periodically after about 12 - 16 hours the log stops with the message “No information to display” and on the Proxmox Zigbee2MQTT server it begins to actively use RAM and processor resources with increasing amounts, RAM usage starts to grow from 200 MB and within 10 hours it grows to more than 1 GB and Zigbee2MQTT starts to work slower. Rebooting Zigbee2MQTT helps, but this is constantly repeated every day. I thought that the problem was related to the coordinator SLZB-06P and its firmware did not change the situation, against the background of the release of the new coordinator SLZB-06P7 I decided to buy it and thought that I would get rid of this problem, I transferred the Zigbee network to the new coordinator SLZB-06P7 and the problem was identically repeated every day and I have to restart Zigbee2MQTT. I cannot find a problem to fix, there are no errors in the logs at the time the log was stopped and before it was stopped, I re-read all possible logs of the Proxmox server, but there is no hint of errors there either. I hope that someone can help me understand what the problem might be so that I can solve it. I can add anything extra that is needed.

Addon version

0.2.0

Platform

Proxmox 8.1.4, Zigbee2MQTT Proxmox in an LXC Proxmox container, Home Assistant OS 2024.2.5

configuration.yaml

frontend:
  port: 8080
mqtt:
  base_topic: zigbee2mqtt_slzb06
  server: mqtt://192.168.13.84:1883
  user: *****
  password: ***********
  keepalive: 60
  reject_unauthorized: true
  version: 5
serial:
  port: tcp://192.168.13.81:6638
  baudrate: 115200
  disable_led: false
  adapter: zstack
advanced:
  cache_state: true
  cache_state_persistent: true
  cache_state_send_on_startup: true
  adapter_delay: 0
  transmit_power: 20
  channel: 11
  pan_id: 17920
  network_key:
    - 10
    - 130
    - 24
    - 80
    - 133
    - 143
    - 189
    - 194
    - 26
    - 113
    - 60
    - 103
    - 58
    - 52
    - 224
    - 23
  availability_blocklist: []
  availability_passlist: []
  homeassistant_legacy_entity_attributes: false
  legacy_api: true
  legacy_availability_payload: false
  last_seen: ISO_8601_local
  elapsed: false
  output: json
  log_level: debug
  timestamp_format: YYYY-MM-DD HH:mm:ss
  log_directory: /opt/zigbee2mqtt/data/log/%TIMESTAMP%
  log_file: log.txt
  log_rotation: true
  log_output:
    - file
  log_symlink_current: false
  log_syslog:
    host: localhost
    port: 514
    protocol: udp4
    path: /dev/log
    pid: process.pid
    facility: local0
    localhost: localhost
    type: '5424'
    app_name: Zigbee2MQTT
    eol: /n
homeassistant:
  discovery_topic: homeassistant
  status_topic: hass/status
  legacy_entity_attributes: false
  legacy_triggers: true
permit_join: false
device_options:
  legacy: false
external_converters:
  - msh.pzem.3f.js
  - ts0601.js
availability:
  active:
    timeout: 10
  passive:
    timeout: 360

Logs of the issue (if applicable)

No response

VladimirTuzovGitHub commented 8 months ago

image

v1k70rk4 commented 8 months ago

Since 1.36, I've also experienced that I need to restart it once a day, but today the program stopped 3 times. After a simple shutdown and restart, it works properly.

jamiellie commented 8 months ago

I have the same issue after updating. Ram usage increases until it eventually crashes. Screenshot_2024-03-08-21-46-17-08_c3a231c25ed346e59462e84656a70e50

v1k70rk4 commented 8 months ago

Yesterday I really lost it because no matter what I did, the addon kept crashing. I have two almost identical apartments, each with its own HomeAssistant, and the other one had no issues at all.

House1: 35 devices
House2: 33 devices
Out of these, 32 devices are the same, and there's no device on one side that isn't on the other.

I spent 2-3 hours excluding this to find out the actual difference.

And this is what I saw:
House1: Coordinator: zStack3x0 rev: 20220219
House2: Coordinator: zStack3x0 rev: 20230507

After seeing this, although the apartment is a bit different, the basics are similar, so I checked at my dad's place too (where there are no issues and everything is up-to-date, and the coordinator also has 20220219 installed).

The Coordinator in all three cases is Sonoff_Zigbee_3.0_USB_Dongle_Plus.

Now I have also uploaded 20220219 at home and everything is working properly with 1.36.0 for 12 hours straight without any errors in the log (there were always a few minor issues before).

Maybe it will help you too, it's worth a try. What's strange is that I've been using the 20230507 version without any issues for almost 7 months.

VladimirTuzovGitHub commented 8 months ago

Found some errors: Mar 11 17:55:55 zigbee2mqtt systemd-networkd-wait-online[1150]: Timeout occurred while waiting for network connectivity. Mar 11 17:55:55 zigbee2mqtt apt-helper[1148]: E: Sub-process /lib/systemd/systemd-networkd-wait-online returned an error code (1) Then, after a while, the devices begin to turn off with a message in the log: Warning 2024-03-11 19:51:05Failed to ping '0x54ef4410009275cf' (attempt 1/2, Read 0x54ef4410009275cf/1 genBasic(["zclVersion"], {"timeout":10000,"disableResponse":false,"disableRecovery" :true,"disableDefaultResponse":true,"direction":0,"srcEndpoint":null,"reservedBits":0,"manufacturerCode":null,"transactionSequenceNumber":null,"writeUndiv":false}) failed (Data request failed with error: 'Timeout' (9999))) And gradually the same for other devices.

When I try to pair, I receive the following response in the log: Info 2024-03-11 19:53:59Zigbee: allowing new devices to join. Error 2024-03-11 19:53:59Request 'zigbee2mqtt_slzb06/bridge/request/permit_join' failed with error: 'SREQ '--> ZDO - mgmtPermitJoinReq - {"addrmode":15,"dstaddr":65532,"duration" :254,"tcsignificance":0}' failed with status '(0x11: BUFFER_FULL)' (expected '(0x00: SUCCESS)')' Info 2024-03-11 19:53:59MQTT publish: topic 'zigbee2mqtt_slzb06/bridge/response/permit_join', payload '{"data":{},"error":"SREQ '--> ZDO - mgmtPermitJoinReq - {\ "addrmode\":15,\"dstaddr\":65532,\"duration\":254,\"tcsignificance\":0}' failed with status '(0x11: BUFFER_FULL)' (expected '(0x00: SUCCESS) ')","status":"error","transaction":"blfgh-1"}'

If you reboot the z2m everything immediately returns to normal as if nothing had happened.

Koenkk commented 8 months ago

@VladimirTuzovGitHub it looks like the FW crashes, we are currently working on a more stable FW, you can try this one: znp_LP_CC1352P7_4_tirtos_ticlang_20240308_10tx.hex.zip, it's for the SLZB-06P7

VladimirTuzovGitHub commented 8 months ago

@Koenkk Hello Koenkk. Thank you for responding to my problem! I just flashed my SLZB-06P7 with the firmware you suggested. I flashed it through Flash Programmer 2, everything went without errors, launched Z2M and immediately Log level - Debug. Now I will monitor the operation of the module for several days and collect a log, I will immediately inform you here about its operation and bring the log if there are problems again. I am attaching the Flash Programmer 2 magazine at the time of flashing the firmware:

>Initiate access to target: COM4 using 2-pin cJTAG.
>Reading file: D:/Vladimir/Documents/Home Assistant/Smlight SLZB-06P7/Firmware/znp_LP_CC1352P7_4_tirtos_ticlang_20240308_10tx.hex/znp_LP_CC1352P7_4_tirtos_ticlang_20240308_10tx.hex.
>Start flash erase ...
>Erase finished successfully.
>Start flash programming ...
>Programming finished successfully.
>Start flash verify ...
>Skip verification of unassigned page: 22.
>Skip verification of unassigned page: 23.
>Skip verification of unassigned page: 24.
>Skip verification of unassigned page: 25.
>Skip verification of unassigned page: 26.
>Skip verification of unassigned page: 27.
>Skip verification of unassigned page: 28.
>Skip verification of unassigned page: 29.
>Skip verification of unassigned page: 30.
>Skip verification of unassigned page: 31.
>Skip verification of unassigned page: 32.
>Skip verification of unassigned page: 33.
>Skip verification of unassigned page: 34.
>Skip verification of unassigned page: 35.
>Skip verification of unassigned page: 36.
>Skip verification of unassigned page: 37.
>Skip verification of unassigned page: 38.
>Skip verification of unassigned page: 39.
>Skip verification of unassigned page: 40.
>Skip verification of unassigned page: 41.
>Skip verification of unassigned page: 42.
>Skip verification of unassigned page: 43.
>Skip verification of unassigned page: 44.
>Skip verification of unassigned page: 45.
>Skip verification of unassigned page: 46.
>Skip verification of unassigned page: 47.
>Skip verification of unassigned page: 48.
>Skip verification of unassigned page: 49.
>Skip verification of unassigned page: 50.
>Skip verification of unassigned page: 51.
>Skip verification of unassigned page: 52.
>Skip verification of unassigned page: 53.
>Skip verification of unassigned page: 54.
>Skip verification of unassigned page: 55.
>Skip verification of unassigned page: 56.
>Skip verification of unassigned page: 57.
>Skip verification of unassigned page: 58.
>Skip verification of unassigned page: 59.
>Skip verification of unassigned page: 60.
>Skip verification of unassigned page: 61.
>Skip verification of unassigned page: 62.
>Skip verification of unassigned page: 63.
>Skip verification of unassigned page: 64.
>Skip verification of unassigned page: 65.
>Skip verification of unassigned page: 66.
>Skip verification of unassigned page: 67.
>Skip verification of unassigned page: 68.
>Skip verification of unassigned page: 69.
>Skip verification of unassigned page: 70.
>Skip verification of unassigned page: 71.
>Skip verification of unassigned page: 72.
>Skip verification of unassigned page: 73.
>Skip verification of unassigned page: 74.
>Skip verification of unassigned page: 75.
>Skip verification of unassigned page: 76.
>Skip verification of unassigned page: 77.
>Skip verification of unassigned page: 78.
>Skip verification of unassigned page: 79.
>Skip verification of unassigned page: 80.
>Skip verification of unassigned page: 81.
>Skip verification of unassigned page: 82.
>Skip verification of unassigned page: 83.
>Skip verification of unassigned page: 84.
>Skip verification of unassigned page: 85.
>Skip verification of unassigned page: 86.
>Page: 0 verified OK.
>Page: 1 verified OK.
>Page: 2 verified OK.
>Page: 3 verified OK.
>Page: 4 verified OK.
>Page: 5 verified OK.
>Page: 6 verified OK.
>Page: 7 verified OK.
>Page: 8 verified OK.
>Page: 9 verified OK.
>Page: 10 verified OK.
>Page: 11 verified OK.
>Page: 12 verified OK.
>Page: 13 verified OK.
>Page: 14 verified OK.
>Page: 15 verified OK.
>Page: 16 verified OK.
>Page: 17 verified OK.
>Page: 18 verified OK.
>Page: 19 verified OK.
>Page: 20 verified OK.
>Page: 21 verified OK.
>Page: 87 verified OK.
>Verification finished successfully.
>Reset target ...
>Reset of target successful.

image

VladimirTuzovGitHub commented 8 months ago

@Koenkk In general, it stopped a couple of times during the day, yesterday I rebooted everything again and decided to observe, now it stopped again and it looks like there is nothing in the logs indicating errors. I am sending this log to you for analysis. Apparently I still need to enable Zigbee-herdsman debug logging, can you help me understand how to run it in proxmox in an LXC container? So that I can also give you these magazines. log1.txt log2.txt

Koenkk commented 8 months ago

Where does the TEST1 / TEST2 logging come from? It's not standard z2m logging

VladimirTuzovGitHub commented 8 months ago

@Koenkk I have a device EARU din smart relay TS0601 TZE204_davzgqq0 for it, I used the converter that you suggested https://gist.github.com/Koenkk/4477f1d8bce028ec5654833168607e1d. I once opened an issue in which you suggested trying this converter. So this device runs threshold tests line by line. And I saw that this device sends a lot of messages every second and I would like to get rid of all the messages associated with the test and threshold, I don’t need threshold at all, I still use the device as a relay with an energy monitor, I don’t need anything else from it, just I don't know how to limit it. Yesterday I also thought that perhaps the problem was in this device and today I divided the network into two Z2M and two coordinators, devices that send often and many messages (including EARU din smart relay TS0601 TZE204_davzgqq0) they are all the same type of DIN rail modules with I sent them with an energy monitor to the SONOFF ZB Dongle-P since they previously worked stably on it, the rest most of the quiet ones were left on the SLZB-06p7. More than 6 hours have already passed and the network is still working stably, I will continue to monitor it.

VladimirTuzovGitHub commented 8 months ago

@Koenkk It worked without failure for 4 days and today an hour ago the log stopped, although the network seems to be working but the log is not kept and it is not clear what could be in a frozen state. There are no errors in the logs and it is not clear what could have caused the failure. I am attaching the latest journal, maybe it will help you understand something. log1.txt log2.txt

github-actions[bot] commented 7 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 7 days