xyzroe / XZG

Unifies the best innovations from previous Zigbee gateway projects into a single, comprehensive solution
https://xzg.xyzroe.cc/
GNU General Public License v3.0
124 stars 27 forks source link

Frequent device reboots. HamGeek HMG-01 Plus POE #108

Open codinv opened 2 months ago

codinv commented 2 months ago

Describe the bug HamGeek HMG-01 Plus POE reboots every day

Device information image

{ "network-config": { "wifiEnbl": false, "wifiSsid": "", "wifiPass": "", "wifiDhcp": true, "wifiIp": "::", "wifiMask": "255.255.255.0", "wifiGate": "::", "wifiDns1": "1.1.1.1", "wifiDns2": "8.8.8.8", "wifiPwr": 78, "wifiMode": 1, "ethEnbl": true, "ethDhcp": true, "ethIp": "::", "ethMask": "255.255.255.0", "ethGate": "::", "ethDns1": "1.1.1.1", "ethDns2": "8.8.8.8" }, "vpn-config": { "wgEnable": false, "wgLocalIP": "::", "wgLocalSubnet": "255.255.255.0", "wgLocalPort": 33333, "wgLocalGateway": "0.0.0.0", "wgLocalPrivKey": "", "wgEndAddr": "", "wgEndPubKey": "", "wgEndPort": 0, "wgAllowedIP": "0.0.0.0", "wgAllowedMask": "0.0.0.0", "wgMakeDefault": true, "wgPreSharedKey": "", "hnEnable": false, "hnJoinCode": "", "hnHostName": "XZG-15B3", "hnDashUrl": "default" }, "mqtt-config": { "enable": true, "server": "192.168.xx.xx", "port": 1883, "user": "xxxxxx", "pass": "xxxxxx", "topic": "HMG-01Plus", "updateInt": 60, "discovery": true, "reconnectInt": 15 }, "system-config": { "disableWeb": false, "webAuth": false, "webUser": "", "webPass": "", "fwEnabled": false, "fwIp": "0.0.0.0", "serialSpeed": 115200, "socketPort": 6638, "tempOffset": 13, "disableLedUSB": false, "disableLedPwr": false, "refreshLogs": 1, "hostname": "HMG-01Plus", "timeZone": "Europe/Moscow", "ntpServ1": "pool.ntp.org", "ntpServ2": "time.google.com", "nightMode": false, "startHour": "23:00", "endHour": "07:00", "workMode": 0, "zbRole": 1, "zbFw": "20240710", "updHour": "02:00", "updDays": "7", "autoIns": false } }

To Reproduce turn on and wait, the device will reboot itself within 24 hours

Additional context ==== and much earlier ===

[2024-09-08 20:14:30] info: zh:zstack:znp: Socket error [2024-09-08 20:14:30] info: zh:zstack:znp: Port closed [2024-09-08 20:14:30] info: zh:zstack:znp: closing [2024-09-08 20:14:30] error: z2m: Adapter disconnected, stopping [2024-09-08 20:14:30] info: z2m:mqtt: MQTT publish: topic 'zigbee2mqtt/bridge/state', payload '{"state":"offline"}'

[2024-09-09 20:29:45] info: zh:zstack:znp: Socket error [2024-09-09 20:29:45] info: zh:zstack:znp: Port closed [2024-09-09 20:29:45] info: zh:zstack:znp: closing [2024-09-09 20:29:45] error: z2m: Adapter disconnected, stopping [2024-09-09 20:29:45] info: z2m:mqtt: MQTT publish: topic 'zigbee2mqtt/bridge/state', payload '{"state":"offline"}'

[2024-09-10 20:29:41] info: zh:zstack:znp: Socket error [2024-09-10 20:29:41] info: zh:zstack:znp: Port closed [2024-09-10 20:29:41] info: zh:zstack:znp: closing [2024-09-10 20:29:41] error: z2m: Adapter disconnected, stopping [2024-09-10 20:29:41] info: z2m:mqtt: MQTT publish: topic 'zigbee2mqtt/bridge/state', payload '{"state":"offline"}'

[2024-09-11 21:04:08] info: zh:zstack:znp: Socket error [2024-09-11 21:04:08] info: zh:zstack:znp: Port closed [2024-09-11 21:04:08] info: zh:zstack:znp: closing [2024-09-11 21:04:08] error: z2m: Adapter disconnected, stopping [2024-09-11 21:04:08] info: z2m:mqtt: MQTT publish: topic 'zigbee2mqtt/bridge/state', payload '{"state":"offline"}'

[2024-09-12 21:02:22] info: zh:zstack:znp: Socket error [2024-09-12 21:02:22] info: zh:zstack:znp: Port closed [2024-09-12 21:02:22] info: zh:zstack:znp: closing [2024-09-12 21:02:22] error: z2m: Adapter disconnected, stopping [2024-09-12 21:02:22] info: z2m:mqtt: MQTT publish: topic 'zigbee2mqtt/bridge/state', payload '{"state":"offline"}'

xyzroe commented 1 month ago

Try to disable cron based update availability check

codinv commented 1 month ago

Okay, I just updated the firmware and set "Disabled" in the update check interval But, I would like to note that earlier I had already configured an update only on Sunday, and at two o'clock in the morning, but the device reboots daily and a few hours earlier. Anyway, thank you for the update, I'll let you know if it will help with my problem

codinv commented 1 month ago

@xyzroe does not help :( 2024-09-14 18:28:10 - update firmware to 20240914 and reboot device 2024-09-14 18:29:49 - set disable check update and reboot device 2024-09-14 21:01:28 - independent reboot of the device

image image

codinv commented 1 month ago

and again

[2024-09-15 14:39:07] info: zh:zstack:znp: Socket error [2024-09-15 14:39:07] info: zh:zstack:znp: Port closed [2024-09-15 14:39:07] info: zh:zstack:znp: closing [2024-09-15 14:39:07] error: z2m: Adapter disconnected, stopping [2024-09-15 14:39:07] info: z2m:mqtt: MQTT publish: topic 'zigbee2mqtt/bridge/state', payload '{"state":"offline"}'

[2024-09-15 17:36:56] info: zh:zstack:znp: Socket error [2024-09-15 17:36:56] info: zh:zstack:znp: Port closed [2024-09-15 17:36:56] info: zh:zstack:znp: closing [2024-09-15 17:36:56] error: z2m: Adapter disconnected, stopping [2024-09-15 17:36:56] info: z2m:mqtt: MQTT publish: topic 'zigbee2mqtt/bridge/state', payload '{"state":"offline"}'

untergeek commented 1 month ago

I have the exact same problem on my UZG-01, spontaneously rebooting with no real explanation why. I see that the uptime matches roughly from when the zigbee2mqtt logs report the Socket error.

UPDATE (2024-09-19): I made a few edits to the configuration on each of my devices (I have 2) and hit "Save" and "Reboot Now" less than 30 seconds apart on each of them. Something is DEFINITELY weird here because they both rebooted at effectively the same amount of elapsed time:

These devices have totally different devices connected to them. So far in my migration, one only has 2 devices, the other 11. I ran these equivalent commands at the same time in two terminal windows, side by side (up arrow, press enter).

kubectl -n z2m-black get all
NAME                READY   STATUS    RESTARTS         AGE
pod/zigbee2mqtt-0   1/1     Running   15 (3m20s ago)   20h

and

kubectl -n z2m-white get all
NAME                READY   STATUS    RESTARTS        AGE
pod/zigbee2mqtt-0   1/1     Running   8 (3m45s ago)   20h

Literally, 25 seconds apart, almost exactly the time between my manual restarts (the upper one had several manual restarts, hence the 15 vs. 8).

So, what gives? What waits somewhere between 6 and 8 hours and then just insta-kills the TCP connections to the device, causing zigbee2mqtt to display, Socket error and restart?

untergeek commented 1 month ago

Oh, and cron checks are disabled.

xyzroe commented 1 month ago

Something is DEFINITELY weird here because they both rebooted at effectively the same amount of elapsed time:

What are you trying to say? What is wrong? You click reboot - you have reboot.

So, what gives? What waits somewhere between 6 and 8 hours and then just insta-kills the TCP connections to the device, causing zigbee2mqtt to display, Socket error and restart?

When the device is rebooting the TCP server on XZG kills itself. It's ok.

But we need to figure out why your device is rebooting. Are you using DCHP on your device? Could you try to switch to static IP?

untergeek commented 1 month ago

What I was saying is that the non-forced reboots happened at the same time after each box was manually rebooted. I thought that to be rather odd.

I will try to set the IPs manually. And I will disable WiFi and go wired-only, just to isolate.

codinv commented 1 month ago

But we need to figure out why your device is rebooting. Are you using DCHP on your device? Could you try to switch to static IP?

I have a static IP by MAC address for HMG-01 on my router, but the device still reboots, sometimes even more than once a day. A few days ago I moved all zigbee devices to another router, but that didn't help either, there were several reboots again. It seems to me that reboots occur when the device tries to access some resource on the external network, but cannot do so. If you completely block the device from accessing the Internet, reboots still occur. Is it possible to somehow organize logging on the device to find out the exact reason for the reboot?

xyzroe commented 1 month ago

I have a static IP by MAC address for HMG-01 on my router, but the device still reboots, sometimes even more than once a day.

No, I'm talking about real static IP on device. Not on DHCP server

xyzroe commented 1 month ago

It seems to me that reboots occur when the device tries to access some resource on the external network, but cannot do so.

it is possible, but if this is the case, then when you turn off full access to the Internet, the device will not be able to turn on, because during startup, the presence of updates is checked.

xyzroe commented 1 month ago

Is it possible to somehow organize logging on the device to find out the exact reason for the reboot?

Of course, project is open source, anyone can make a such option.

xyzroe commented 1 month ago

disable WiFi and go wired-only

It's the first thing that you must to do in case of any problems.

xyzroe commented 1 month ago

I know of a case where a special DHCP server configuration causes the device to reboot every time a DHCP lease is updated. Compare reboot frequency and lifetime of DHCP entry in your router

untergeek commented 1 month ago

disable WiFi and go wired-only

It's the first thing that you must to do in case of any problems.

I was using both wired and wireless. Each was on a different subnet. Now wired has a statically assigned IP address (not via DHCP) and WiFi is off. Will report back findings when I see the next reboot (or if I don't see one).

untergeek commented 1 month ago

I know of a case where a special DHCP server configuration causes the device to reboot every time a DHCP lease is updated. Compare reboot frequency and lifetime of DHCP entry in your router

Interesting. My DHCP server's default-lease-time is 7200 seconds, or 2 hours, with the max-lease-time set to 86400 seconds (1 day). The reboots were not happening every 2 hours, but more frequently than once per day. Is that special DHCP server configuration 100% guaranteed to trip each time. 🤷‍♂️

I guess we'll wait and see in my case.

codinv commented 1 month ago

then when you turn off full access to the Internet, the device will not be able to turn on, because during startup, the presence of updates is checked.

when the internet is full turned off for only this device, it boots normally. the only thing I changed was to assign my own NTP server so that the time would be set correctly on the device

Compare reboot frequency and lifetime of DHCP entry in your router

lease 25200 seconds (every 7 hours) and since reboots are mostly once every 24 hours, it turns out that this is not the reason

Of course, project is open source, anyone can make a such option.

hmm.. this looks very difficult =)

untergeek commented 1 month ago

So, status report. Both of my zigbee2mqtt instances spontaneously rebooted at least one time. However, this is what I see today:

NAME                READY   STATUS    RESTARTS      AGE
pod/zigbee2mqtt-0   1/1     Running   20 (4d ago)   5d19h

and

NAME                READY   STATUS    RESTARTS      AGE
pod/zigbee2mqtt-0   1/1     Running   10 (4d ago)   5d19h

Static wired IP and no WiFi seem to be more stable. I did not add or remove any devices in that timeframe, or any other activity besides just "running." I will be adding another round of devices soon, however, so we will see if that activity changes anything.

I will continue to monitor.

rsdes commented 1 month ago

I experience the same issue, Reboots after 4h till 24h uptime.

Gerätemodell: HG POE Plus FW-Revision: 20241001 HW-Version: ESP32-D0WD-V3

Static IP, no WiFi, no Update Check, local NTP Server, local DNS Server, not one Zigbee device connected just idling.

The Console shows the following output, maybe someone can read something from this output.

` E (43887268) wifi_init_default: netstack cb reg failed with 12308 E (43891258) wifi_init_default: netstack cb reg failed with 12308 Guru Meditation Error: Core 0 panic'ed (Unhandled debug exception). Debug exception reason: Stack canary watchpoint triggered (Tmr Svc) Core 0 register dump: PC : 0x40091fa3 PS : 0x00060b36 A0 : 0x8008f72d A1 : 0x3ffcf570
A2 : 0x3ffd580c A3 : 0xb33fffff A4 : 0x0000abab A5 : 0x00060b23
A6 : 0x00060b20 A7 : 0x0000cdcd A8 : 0xb33fffff A9 : 0xffffffff
A10 : 0x00000003 A11 : 0x00060b23 A12 : 0x00060b20 A13 : 0x0000003f
A14 : 0x007d580c A15 : 0x003fffff SAR : 0x00000004 EXCCAUSE: 0x00000001
EXCVADDR: 0x00000000 LBEG : 0x4008b4a5 LEND : 0x4008b4b5 LCOUNT : 0xfffffffe

Backtrace: 0x40091fa0:0x3ffcf570 0x4008f72a:0x3ffcf5b0 0x4008f7ec:0x3ffcf5f0 0x4008552d:0x3ffcf610 0x40085639:0x3ffcf640 0x4010d0da:0x3ffcf660 0x4010c357:0x3ffcf680 0x4010b905:0x3ffcf6a0 0x4008b085:0x3ffcf6c0 0x40089926:0x3ffcf6e0 0x400899b0:0x3ffcf700 0x40089e9d:0x3ffcf730 0x401a7697:0x3ffcf760 0x401a2b73:0x3ffcf780 0x401a2cf1:0x3ffcfa90 0x401b4fc5:0x3ffcfac0 0x4013a6da:0x3ffcfaf0 0x4013a777:0x3ffcfb20 0x4011ad1d:0x3ffcfba0 0x4011ad7e:0x3ffcfbc0 0x400f0504:0x3ffcfbe0 0x400d9f3a:0x3ffcfc00 0x400dda9e:0x3ffcfc30 0x400e0f0b:0x3ffcfc50 0x40091b43:0x3ffcfce0

ELF file SHA256: 2e8b942b1a999a93

Rebooting... ets Jul 29 2019 12:21:46`

Can I increase the Debug Level to have more output?

Thanks for your efforts.

untergeek commented 1 month ago

wifi_init_default: netstack cb reg failed with 12308

That seems odd, if WiFi is disabled. What's being initialized?

rsdes commented 1 month ago

wifi_init_default: netstack cb reg failed with 12308

That seems odd, if WiFi is disabled. What's being initialized?

I agree, but thats what I see in the logs.

Never touched the WIFi settings. image

xyzroe commented 1 month ago

Thanks for feedback. Already found this bug, trying to fix.

turbolooser commented 1 month ago

Would this bug also affect my zigbee-gw-lan HW? I do see super frequent restarts also... :-( Any hint other than statis ip , disabling update check etc?