Endless "initialization loop" after restart

Lurker00 commented 5 months ago

LocalTuya Version

3.2.5.2b7

Home Assistant Version

2024.5.3

Environment

[ ] Does the device work using the Home Assistant Tuya Cloud component?
[ ] Is this device connected to another local integration, including Home Assistant and any other tools?
[ ] The devices are within the same HA subnet, and they get discovered automatically when I add them

What happened?

After some recent changes (moved HA from VM to a dedicated armbian box, added some devices, HA updates to 2024.5.*), a weird problem arises: after HA restart, LocalTuya falls into endless loop of the initial sequence trying to connect to all my devices to obtain initial entities values. During this time, HA becomes useless, because all the devices fall into "Became unavailable" every some seconds.

If I don't touch the system, it may continue running as such for a long time. It seems that adding or reconfiguring a device (also not related to LocalTuya, e.g. via direct BLE integration) stops that loop.

I've applied your #226 fix before the last restart (with the log attached), just for information.

I don't mark this issue as a bug because I have no idea what was the reason that this problem started to happen: whether it was HA update, or newly added devices, or a combination of events...

I attach my current diagnostics of all the devices, and a log which ended when that loop just stopped. Note, that the device "[bf5...wrf]" is disabled (a gateway with 2 DP's). But in the log there are plenty of records

2024-05-11 09:24:30.688 DEBUG (MainThread) [custom_components.localtuya.number] [bf5...wrf] Entity up_channel (DP 1) - Not restoring as restore on reconnect is disabled for this entity and the entity has an initial status or it is not a passive entity
2024-05-11 09:24:30.688 DEBUG (MainThread) [custom_components.localtuya.number] [bf5...wrf] Entity down_channel (DP 2) - Not restoring as restore on reconnect is disabled for this entity and the entity has an initial status or it is not a passive entity

which should only happen once at the very beginning. That's why I call it "endless initialization loop".

Steps to reproduce.

Don't know

Relevant log output

No response

Diagnostics information.

home-assistant.log config_entry-localtuya-944c1616039a1d1ec6b9745179bff74e (5).json

xZetsubou commented 5 months ago

It should happens for every entity added still I'm not sure if that is the reason?

I don't understand how can "LocalTuya" causes this there is no waiting on initialization it only wait for 1 device to connect or fails then continue. Is there a notification message at the bottom left "waiting for "LocalTuya" to startup" ? maybe full HA logs may help.

Lurker00 commented 5 months ago

I don't understand how can "LocalTuya" causes this

Neither do I. That why I didn't use "Bug" prefix in the title. But you can see in the log that LocalTuya connects all the devices over and over again, with all successful results! There must be something that triggers this process.

Is there a notification message at the bottom left "waiting for "LocalTuya" to startup" ?

No. It showed the message that HA started. Moreover, I have a an automation that sends a message to Telegram on HA started event, I've received it.

maybe full HA logs may help.

Should I restart HA with "default: debug" for the logger, or what? That log was recorded with default warning level. Writing full huge logs may affect the system by itself, so if you have a suggestion what is the minimum set of "logger" options you need, you're welcome!

xZetsubou commented 5 months ago

Yeah enable debug level for logs and then post the full logs of HA, I find it difficult to find out what might cause the infinity loop in localtuya.

Lurker00 commented 5 months ago

Here it is: home-assistant.log.zip

First I noticed Bluetooth related messages right before each LocalTuya "restart" and disabled Bluetooth intergation, to no avail. Then I tried to reconfigure a device (it helped to stop the loop before), but, alas, it continued.

Currently, I restarted HA without debug logging, and can't stop this process: LocalTuya restarts connecting to devices every 30 seconds, making the devices at this time unavailable.

Probably, it would not be a big problem if the devices were not become unavailable, keeping their statuses. But this way no an automation works.

Lurker00 commented 5 months ago

I've managed to make HA working after a restart. Before I've uninstalled Bluetooth support in the armbian, and deleted all Bluetooth related intergations and devices. But I can't be sure about what caused this misbehavior the first time: I've also removed some other stuff, which didn't make a difference when I did it. Currently I'm afraid to continue investigation. It is running for 10 minutes without a glitch already, and I prefer to keep it running for a while.

Lurker00 commented 5 months ago

Please note, that this sequence

2024-05-12 11:15:57.088 DEBUG (MainThread) [custom_components.localtuya.coordinator] [bf5...whd] Trying to connect to КЛ: Zigbee Bulb...
2024-05-12 11:15:57.101 DEBUG (MainThread) [custom_components.localtuya.coordinator] [bf2...ath] Trying to connect to ГСЛ: Zigbee Bulb...
2024-05-12 11:15:57.104 DEBUG (MainThread) [custom_components.localtuya.coordinator] [bfe...7yz] Trying to connect to ГС: Motion Sensor...
2024-05-12 11:15:57.109 DEBUG (MainThread) [custom_components.localtuya.coordinator] [bfa...hoi] Trying to connect to К: 3x Переключатель...
2024-05-12 11:15:57.113 DEBUG (MainThread) [custom_components.localtuya.coordinator] [bf0...tdq] Trying to connect to ГС: Выключатель...
2024-05-12 11:15:57.118 DEBUG (MainThread) [custom_components.localtuya.coordinator] [bf1...kn4] Trying to connect to ГС: 3x Переключатель...
2024-05-12 11:15:57.121 DEBUG (MainThread) [custom_components.localtuya.coordinator] [bf1...xw5] Trying to connect to К: Выключатель...
2024-05-12 11:15:57.127 DEBUG (MainThread) [custom_components.localtuya.coordinator] [bf1...z1q] Trying to connect to ГС: HP&L Сенсор...
2024-05-12 11:15:57.132 DEBUG (MainThread) [custom_components.localtuya.coordinator] [bf0...hwo] Trying to connect to К: M&L Сенсор...
2024-05-12 11:15:57.137 DEBUG (MainThread) [custom_components.localtuya.coordinator] [bf8...mlc] Trying to connect to К: Роутер...
2024-05-12 11:15:57.141 DEBUG (MainThread) [custom_components.localtuya.coordinator] [bfe...tf0] Trying to connect to ГС: Роутер...
2024-05-12 11:15:57.146 DEBUG (MainThread) [custom_components.localtuya.coordinator] [bf6...stj] Trying to connect to Тест: Smart Plug+...
2024-05-12 11:15:57.151 DEBUG (MainThread) [custom_components.localtuya.coordinator] [bfb...cbv] Trying to connect to К: Реле 4х...
2024-05-12 11:15:57.155 DEBUG (MainThread) [custom_components.localtuya.coordinator] [bfa...dbm] Trying to connect to ГСЛ: Реле 4х...
2024-05-12 11:15:57.161 DEBUG (MainThread) [custom_components.localtuya.coordinator] [bf2...aky] Trying to connect to ГС: 2х Переключатель...
2024-05-12 11:15:57.165 DEBUG (MainThread) [custom_components.localtuya.coordinator] [bff...ns9] Trying to connect to ПЛ: Диммер...
2024-05-12 11:15:57.170 DEBUG (MainThread) [custom_components.localtuya.coordinator] [bf0...epk] Trying to connect to П: Выключатель...
2024-05-12 11:15:57.174 DEBUG (MainThread) [custom_components.localtuya.coordinator] [bfc...b03] Trying to connect to П: Motion Sensor...

which continues over and over again, consists of devices of one gateway, Zigbee only, wired (Ethernet). I also have one BLE-only wireless gateway, and one Zigbee+BLE wireless gateway. For the latter, I sometimes see such sequences as well, but only 2-3 loops, never more. I also have a number of WiFi devices, and no one experiences such misbehaviour.

So, LocalTuya re-initializes not all the devices, but only devices connected to this one particular gateway (bf504372fcc2cc8b10fwrf, 192.168.0.31). Based on this, I'm in doubt that it is HA-related or BLE-related problem...

The problem started to happen in a short period of time (some hours) after I've added 3 new devices to this gateway, and created automations for them. But all was working for some hours.

These new devices are: a dimmer (bffb4b344babe630c4lns9), scene switch (bf0c1d4f7f66ddd4b2lepk) and PIR motion sensor (bfcaf22d28414bacb8cb03). For the dimmer ("tdq"), LocalTuya has no support for its DPs (switch and brightness control), but I added them manually.

fabien4444 commented 5 months ago

Hello, i have to add that only my sub device is unavailable! I have 3 others tuya main devices which are working well.

Lurker00 commented 5 months ago

I've restored core 2024.5.1 from backup, returned back my new devices, and the problem started again.

Lurker00 commented 5 months ago

From my experience, I'm able to stop that loop by disabling or enabling a LocalTuya device. It seems it does not really matter which device. E.g. it can be BLE gateway.

xZetsubou commented 5 months ago

Can you try disabling Sub-Devices only, to see if there will be any changes by doing that or not.

edit: I'll also refactor the setup entry function later, since HA made changes to it in 2024.5

Lurker00 commented 5 months ago

Can you try disabling Sub-Devices only,

Yes, I did it. More over, once I (by mistake) disabled and then enabled only one entity (the switch of the dimmer), and it stopped the loop.

But it is a random process. I just woke up and I see in the log, that some hours ago the loop started again doing it every 1 second, and it stopped by itself after 9 iterations. Then, in around 30 minutes, it started again every 30 seconds, and stopped after 14 iterations. Still it didn't happen.

~I have a thought that it may be caused by "too many" Zigbee devices bound to one gateway. The aforementioned gateway has now 17 active devices, and 3 of them were aded recently. When I revered back to core 2024.5.1, I've lost them, and there was no the problem after this HA restart. But the problem returned back after I added them, along with manual steps to configure DPs of the dimmer and PIR sensor.~ - I disabled more than 3 devices, to no avail.

Lurker00 commented 5 months ago

I changed some messages from debug to info, and added "Sub-device name disconnected" to def disconnected(self) here: https://github.com/xZetsubou/hass-localtuya/blob/298d561d72844822f5c24b10574bd4c061588041/custom_components/localtuya/coordinator.py#L455-L459

I also changed HEARTBEAT_SUB_DEVICES_INTERVAL to 60, and HEARTBEAT_INTERVAL to 20. After a restart, the problem disappeared in some minutes. Here is the log: home-assistant_2024-05-14T19-35-07.947Z.log

Please note:

192.168.0.37 is low-power WiFi T&H sensor. Ignore it.
The period has changed to 60 seconds.
For each device, from the very beginning, there are two "Trying to connect" records with one second in between. Is it correct?
Only for Zigbee devices connected to that router, there is a callback for disconnect almost immediately after successful connect (in next attempts they disconnect after HEARTBEAT_SUB_DEVICES_INTERVAL). Unfortunately, you don't log disconnect events, and they are missed in full logs I submitted before.

Here is an extraction for one device:

2024-05-14 22:08:30.667 INFO (MainThread) [custom_components.localtuya.coordinator] [bf5...whd] Trying to connect to КЛ: Zigbee Bulb...
2024-05-14 22:08:31.127 INFO (MainThread) [custom_components.localtuya.coordinator] [bf5...whd] Trying to connect to КЛ: Zigbee Bulb...
2024-05-14 22:08:31.225 INFO (MainThread) [custom_components.localtuya.coordinator] [bf5...whd] Started heartbeat for КЛ: Zigbee Bulb
2024-05-14 22:08:31.226 INFO (MainThread) [custom_components.localtuya.coordinator] [bf5...whd] Success: connected to КЛ: Zigbee Bulb
2024-05-14 22:08:31.909 INFO (MainThread) [custom_components.localtuya.coordinator] [bf5...whd] Sub-device КЛ: Zigbee Bulb disconnected.
2024-05-14 22:08:35.796 INFO (MainThread) [custom_components.localtuya.coordinator] [bf5...whd] Trying to connect to КЛ: Zigbee Bulb...

I'm quite happy that it finally "calmed down" (I had to add one more device today evening, and then couldn't stop this loop) and I can go to sleep.

By the way, I've updated to 2024.5.0b1.

xZetsubou commented 5 months ago

Actually I'm adjusting the way of sub-devices connecting process.

Sub-devices don't need to be waited for connection process like parent devices.
The heartbeat of sub-devices should start later after sub-devices connected and not while it is in the connecting process. ^ This can be fixed by moving asyncio sleep above subdevices_query call in start_sub_devices_heartbeat at pytuya module.

I think this what causing the loop.

The sub-devices connecting successfully even tho it's not connected but the gateway may still cached their status
Heartbeat for sub-devices starts on the mean time while sub-devices are still in connecting process.
Sub-devices HB may reports that it's actually not connected and disconnected() will be called. This is still on the 1st connecting process
Now that disconnected() function called localtuya will immediately re-try to reconnect and that may causes the loop? This is probably why there is 2 attempt of connection at the starts.

Lurker00 commented 5 months ago

I've added logging into def _setup_dispatcher(self) -> MessageDispatcher https://github.com/xZetsubou/hass-localtuya/blob/d64fc6820f0939daae4605d5776f80d7ff21834f/custom_components/localtuya/core/pytuya/__init__.py#L832-L836

and yes: LocalTuya disconnects here! home-assistant_2024-05-15T07-32-30.556Z.log

2. This can be fixed by moving asyncio sleep above subdevices_query call in start_sub_devices_heartbeat at pytuya module.

It does not help to solve this particular problem: LocalTuya continues disconnecting devices. Nor it prevents double connections at the start. Here is the log: home-assistant_2024-05-15T07-40-22.231Z.log

I've changed the condition at line 835 there to

                        if cid in off_devices:

and finally my system is up and running 😄 home-assistant_2024-05-15T07-47-01.946Z.log

I believe it was a bad logic.

But, with those both fixes, it still connects two times for each device in 1 second.

My opinion is that LocalTuya needs here deep change to its architecture: to create one and the only connection to each gateway, to use it for all the subdevices connected to this gateway. A gateway, as I understand, has limited abilities to serve many connections.

Lurker00 commented 5 months ago

Look to the following two answers from my gateway:

2024-05-15 10:34:41.728 INFO (MainThread) [custom_components.localtuya.core.pytuya] [bf5...whd] Sub-Devices States Update: {'reqType': 'subdev_online_stat_report', 'data': {'online': ['a4c138e51d9c71a2', 'a4c13825abcf5819', 'a4c138ba5b2576e5', 'a4c13825798525df', 'a4c138de3eb1f3f3', 'a4c1380ad40d076d', 'a4c1380e8e46d3c4', '00124b002d8e6a68', 'a4c138717b6a9928', '00124b00272127b7', 'a4c138f46f838ee0', 'a4c1389aeca68fbf', 'a4c138a112e22ebe', 'a4c1384bb9cd4e1d', 'a4c1380f8554dc9d'], 'offline': ['a4c1382e58833a9c', 'a4c138c67c8e556e']}}. a4c138b97aac2ae0 disconnected
2024-05-15 10:34:41.983 INFO (MainThread) [custom_components.localtuya.core.pytuya] [bf5...whd] Sub-Devices States Update: {'reqType': 'subdev_online_stat_report', 'data': {'online': ['a4c138c4b14360be', 'a4c138d43c2db604', 'a4c138b97aac2ae0'], 'offline': []}}. a4c1389aeca68fbf disconnected

It seems if a gateway has connected many devices, it reports online/offline statuses of subdevices in a series of messages. I faced the problem exactly when I've added 3 aforementioned new devices (note the second message: exactly 3 cids!). I thought disabling them in LocalTuya would help, but the reason was at the gateway side.

So, checking against 'offline' list is always good.

To check against 'online', it is required to combine several answers into one. But I doubt LocalTuya needs it at all, because if a subdevice was disconnected from the gateway (the only reason to be missed in both lists), LocalTuya would find it out by other means anyway. E.g. by checking against device list from the cloud.

My suggestion is to remove the check against on_devices as the fix for this problem.

But that double attempt to connect at the start better to fix as well!

Lurker00 commented 5 months ago

One more thought: when you send the subdev_online_stat_report request to the gateway, it may answer by more than one messages! Now LocalTuya expects only one reply. The second reply is read later, and can be interpreted wrong way.

Lurker00 commented 5 months ago

I'd suggest you to change log messages in _make_connection to include both name and host, like this:

self.debug(f"Trying to connect to {name} ({self._device_config.host})...", force=True)
self.debug(f"Started heartbeat for {name} ({self._device_config.host})", force=True)
self.debug(f"Success: connected to {name} ({self._device_config.host})", force=True)

to better understand which WiFi devices produce the messages, and which gateway is used for a subdevice. Also for warnings, errors and exceptions.

Lurker00 commented 5 months ago

FYI, with the fix suggested in https://github.com/xZetsubou/hass-localtuya/issues/227#issuecomment-2111825467, it runs very well. I had a device with weak signal, and its offline state was handled correctly by the sub-devices heartbeat several times. So, I believe it was tested enough.

xZetsubou commented 5 months ago

The issue with this approach that if the device has been removed from the gateway it won't shows as "offline" rather it will disappear from both on and off devices.

Lurker00 commented 5 months ago

It is not a big issue! If required, it can be detected from the information from the cloud. The user can delete the device, if (s)he is concerned.

Moreover, I'd not disconnect a sub-device from offline list, especially if it is used as a fake gateway. The reason is because it does not hurt to use an offline device as a fake gateway, but disconnecting it causes disconnecting all other sub-devices, and then it falls into the loop of connecting-disconnecting (I saw it in logs). In brief, the current implementation destroys the functionality for all sub-devices, if the first in the list goes offline.

My opinion is that the online and offilne lists could be used to prevent connections to offline devices until they apper online, but it is not implemented.

The current implementation of subdev_online_stat_report results looks more harmful than useful for me.

xZetsubou commented 5 months ago

I know but and I'm still looking for better solution if possible, I don't have a lot of sub-devices to test query command on, Does the command always report all sub-devices you have or does it miss some of them sometimes?

Lurker00 commented 5 months ago

Yes, currently the gateway always replies with two messages showed in https://github.com/xZetsubou/hass-localtuya/issues/227#issuecomment-2112020068 in log records with all the devices attached to the gateway. I have two powered off devices (disabled in LocalTuya) which are listed in the 1st message. The two online lists have all my online devices when they are online. Of course, when I temporarily turn off some of them, or in case of weak signal, both messages still show all the devices, with correct division into online and offline.

I presume that with even more devices, the number of messages would be more.

xZetsubou commented 5 months ago

Can you give 2024.5.0b2 a try

Lurker00 commented 5 months ago

Can you give 2024.5.0b2 a try

I updated to it, but changed some log messages from debug to info, for better understanding. It started, and twice re-started, with no problems. Let's see it on the long run.

I see, you didn't implement here TimeoutError handling in start_sub_devices_heartbeat 😞 I re-implemented it, because, from logs, I found it useful.

xZetsubou commented 5 months ago

The will be no false timeout anymore for sub-devices HB like before, unless the device actually didn't responses at all, so you can try it without adding timeout retries, Matter fact your tests give me more information I don't have the amount of sub-devices like you 😺, so I'll be glad if you tested with normal release

Lurker00 commented 5 months ago

I've turned off a dimmer, and in some seconds received corresponding message in the log. So, the new design detects offline devices, and so far works with no false disconnect events.

Currently, LocalTuya logs are either debug, or empty. I found useful to have more information at info level: https://github.com/xZetsubou/hass-localtuya/blob/e1852fdafc06e6df49b12e2d983a5ee9fed458e0/custom_components/localtuya/coordinator.py#L258 https://github.com/xZetsubou/hass-localtuya/blob/298d561d72844822f5c24b10574bd4c061588041/custom_components/localtuya/core/pytuya/__init__.py#L925

Inserted after this line https://github.com/xZetsubou/hass-localtuya/blob/e1852fdafc06e6df49b12e2d983a5ee9fed458e0/custom_components/localtuya/coordinator.py#L463

        if self._node_id:
            self.info(f"Disconnected: {self._device_config.name} {self._device_config.host} {self._node_id}")
        else:
            self.info(f"Disconnected: {self._device_config.name} {self._device_config.host}")

(currently the first message is a warning, that's why if)

Also one more debug->warning, wrote a comment to the commit.

Lurker00 commented 5 months ago

The will be no false timeout anymore

Yesterday I have had big problem with my internet provider for a whole day. All my automations are local, no direct involving of the cloud. But each device with direct connection to the cloud (Wi-Fi devices and gateways) report their states, or state changes for their sub-devices. When they can't connect or send data to the cloud, they become very unresponsive! Lots of timeouts, lots of Zigbee sub-devices went offline because they failed to inform their gateway (which were just busy), very slow reaction (if any) on events.

So, don't presume that a device either must answer in 5 seconds, or it can't answer at all. It can be busy right now with other tasks.

But I see you've re-wrote it even more. Let's see later. For now, - sorry! - right now I prever to have TimeoutError handled in this particular place for the sake of robustness, based on my experience. But I'll check the logs to see if timeouts ever happened.

xZetsubou commented 5 months ago

Technically, if there's a timeout error, the loop of subdevices_query will break and will not raise anything! Probably what caused the mass timeout errors for is status command that being called in connect process.

Lurker00 commented 5 months ago

In addition to https://github.com/xZetsubou/hass-localtuya/issues/227#issuecomment-2118742776, I uncommented https://github.com/xZetsubou/hass-localtuya/blob/298d561d72844822f5c24b10574bd4c061588041/custom_components/localtuya/core/pytuya/__init__.py#L292-L293

because instead I could not understand what was happening from the log.

Lurker00 commented 5 months ago

Can you give 2024.5.0b2 a try

In general, it runs very well.

But it happened that I had to restart HA many times, and I noticed that sometimes, right after restart, some entities are in unavailable state. The corresponding devices were online, there were connections to them, but their entities were disabled, and even after some minutes didn't come alive. Another restart always was without this problem. I also saw such a probblem once with 3.2.5.2b7. LocalTuya log level is info (with my corresponding changes to the source code), so that I have nothing to support this report.

Lurker00 commented 5 months ago

FYI I've installed 2024.5.0.b4 a hour ago, patched for more info in the logs, and it started and is running smoothly so far. I tested a sub-device going offline and back online, and it worked as expected.

github-actions[bot] commented 5 months ago

This issue was closed because it was resolved on the release: []()

xZetsubou / hass-localtuya