wixoff commented 6 years ago

I posted something about this in the Community Forum, but I hope it will get better visibility here.

At the moment zha network has seven light bulbs, three door sensors, and one switched outlet. The sensors are reliable, but the lights and switch become unavailable (and unresponsive to commands) after a few hours.

SETUP

Currently the lights are all Osram/Sylvania Lightify (U.S. versions) - five are RGBW A19 and two are two are Tunable White A19. When they work, they work great - almost instant response. The switched outlet is an IRIS v2 plug (reads as CentraLite 3210-L, with the odd dual z-wave/zigbee radios).

I’m not sure what the issue is, because ZHA had been pretty darn reliable for a good number of months. (I also have a Tradfi 1000lm A19 bulb that I was able to include months ago, but I have since reset it, nuked my zigbee.db, and attempted to re-add, and it will no longer show up.) Sometimes repeating the request via the UI several times in a row will cause the bulb to wake up and respond, and eventually even that will change to no responses whatsoever.

I also have three Visonic MCT-340E door/window sensors spread around a fairly large house. Even after the bulbs (and the IRIS plug) quit responding, these sensors still work and are very reliable. The built-in temperature sensor works too, on one of them; the others never change their temperature and one of those is stuck at 32F.

ERRORS

Here’s what an error in the log looks like after the lights stop responding - this represents an attempt to turn off a light via the hass UI:

Mon Jul 09 2018 00:10:40 GMT-0400 (EDT)
Error executing service ServiceCall light.turn_off: entity_id=['light.osram_lightify_a19_rgbw_00a3342d_3']
Traceback (most recent call last):
File "/usr/src/app/homeassistant/core.py", line 1021, in _event_to_service_call
await service_handler.func(service_call)
File "/usr/src/app/homeassistant/components/light/__init__.py", line 362, in async_handle_light_service
await light.async_turn_off(**params)
File "/usr/src/app/homeassistant/components/light/zha.py", line 127, in async_turn_off
await self._endpoint.on_off.off()
File "/usr/local/lib/python3.6/site-packages/zigpy/device.py", line 89, in request
expect_reply=expect_reply,
File "/usr/local/lib/python3.6/site-packages/bellows/zigbee/application.py", line 213, in request
assert sequence not in self._pending
AssertionError

And here is another error, trying to use a scene to turn off four lights (as noted above, the Tradfri light fails because it's currently disconnected):

2018-07-10 07:11:09 WARNING (MainThread) [homeassistant.helpers.state] reproduce_state: Unable to find entity switch.ikea_of_sweden_tradfri_bulb_e26_ws_opal_980lm_fe477406_1
2018-07-10 07:11:09 ERROR (MainThread) [homeassistant.core] Error executing service ServiceCall <light.turn_off: entity_id=['light.osram_lightify_a19_tunable_white_0001d4ac_3', 'light.osram_lightify_a19_tunable_white_0001e6c8_3', 'light.osram_lightify_a19_rgbw_000a381d_3']>
Traceback (most recent call last):
File "/usr/src/app/homeassistant/core.py", line 1021, in _event_to_service_call
await service_handler.func(service_call)
File "/usr/src/app/homeassistant/components/light/__init__.py", line 362, in async_handle_light_service
await light.async_turn_off(**params)
File "/usr/src/app/homeassistant/components/light/zha.py", line 127, in async_turn_off
await self._endpoint.on_off.off()
File "/usr/local/lib/python3.6/site-packages/zigpy/device.py", line 89, in request
expect_reply=expect_reply,
File "/usr/local/lib/python3.6/site-packages/bellows/zigbee/application.py", line 213, in request
assert sequence not in self._pending
AssertionError

There are no other zha-related errors in the log, other than during startup.

I can't imagine this is correct behavior. The five RGBW lights and the IRIS plug are all within 10 feet of the HUSBZB-1 stick, most of them line-of-sight. And yet they drop off just as quickly as the ones further away. And the other two Tunable White bulbs are only another 5 feet past the zigbee outlet switch (which should be a router), but they are behind a modern-contstruction wall (wood studs, wallboard, paint).

As I mentioned in my Community post, I'll update the firmware on the bulbs, but I don't expect improvement because the lights were working pretty well a few hass releases ago.

walthowd commented 6 years ago

A long shot, but try modifying your bellows/zigbee/application.py and set CONFIG_PACKET_BUFFER_COUNT to 0x4b.

From what I have seen the main issue on the Si Labs EM zigbee chip is lack of memory and according to their documentation the PACKET BUFFER COUNT is the number one contributing factor to memory usage.

wixoff commented 6 years ago

Thanks @walthowd - I will give that a try. Are you the same walt that's in the Community thread?

EDIT: application.py edited and hass restarted at 1:45 pm EDT. Let's see what happens!

wixoff commented 6 years ago

Didn't work. Attempted to turn off one of the zigbee bulbs at 2:40 pm EDT and it failed. I don't think there was anything else queued - I just manually turned off one single bulb via the hass UI, waited 5-10 seconds, and it popped itself back on in the UI. Light remained on. Same type of error shows in the log.

I tried again about 15 seconds later, and it worked. Light off. This is typical -- bad reliability first, total loss of functionality later.

wixoff commented 6 years ago

Some more verbose logging, as suggested by user adminiuga in the Community thread:

https://paste.drhack.net/?65a1595f54bc5328#7mPFttNdjg7ha59Zic0cuBa37Vr6+ghmHKf98Ng9LWU=

(that's a pastebin-like link, for those who might be woried)

These are excerpts, the first one being an error during setup, and the other two being errors that appear when a remote button is pressed to turn a light on. So I have excerpted sections from the log entry that begin where the remote button press is recognized and continue all the way up through the error and traceback.

wixoff commented 6 years ago

Oops, inadvertently closed.

wixoff commented 6 years ago

Here's the same verbose log but in a non-expiring link:

https://paste.drhack.net/?9f4a4ed3411e7b05#mhgZ6H4ofzqLlxxDW2RJijQv/8+QYafpjlvGTyXF5fM=

When my zigbee devices aren't working, all I need to do is restart hass. Something has regressed, and per the Community thread, one user had to revert to hass 0.67.1 to get it working again. That's about the time I started having trouble too.

Uniqen commented 6 years ago

I've got the same issue as @wixoff. Using a bunch of tradfri and Telegesis ETRX357USB-LRS+8M (flashed with EZSP firmware).

2018-07-30` 21:25:36 ERROR (MainThread) [homeassistant.core] Error executing service <ServiceCall light.turn_on: entity_id=['light.ikea_of_sweden_tradfri_bulb_gu10_w_400lm_fed472da_1', 'light.ikea_of_sweden_tradfri_bulb_gu10_w_400lm_fed548f1_1', 'light.ikea_of_sweden_tradfri_bulb_gu10_w_400lm_fed5a563_1', 'light.ikea_of_sweden_tradfri_bulb_gu10_w_400lm_fed7f30a_1', 'light.ikea_of_sweden_tradfri_bulb_gu10_w_400lm_fed978ee_1', 'light.ikea_of_sweden_tradfri_bulb_gu10_w_400lm_fed98735_1']>
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/homeassistant/core.py", line 1021, in _event_to_service_call
    await service_handler.func(service_call)
  File "/usr/lib/python3.6/site-packages/homeassistant/components/light/__init__.py", line 360, in async_handle_light_service
    await light.async_turn_on(**params)
  File "/usr/lib/python3.6/site-packages/homeassistant/components/light/zha.py", line 108, in async_turn_on
    duration
  File "/usr/lib/python3.6/site-packages/zigpy/device.py", line 89, in request
    expect_reply=expect_reply,
  File "/usr/lib/python3.6/site-packages/bellows/zigbee/application.py", line 213, in request
    assert sequence not in self._pending
AssertionError

techdoutdev commented 6 years ago

I downgraded due to all the same issues as mentioned here; been running 0.67.1 (as @wixoff mentioned above) - no issues like the issues seen in the later version. I believe this is just before HA updated to bellow 0.5.2 in HA 0.68.

rcloran commented 5 years ago

Is this a dup of #78 ?

walthowd commented 5 years ago

@rcloran Yes, this is a dupe of 78 in zigpy/zigpy. I think we found the root issue there - asyncio throwing exceptions which leads to abandoned sequence IDs that eventfully fill self._pending. Over time every request will throw an assertion error.

https://github.com/zigpy/zigpy/issues/78

rcloran commented 5 years ago

OK, let’s keep the discussion in one place.

rcloran commented 5 years ago

Oh, I guess the issue is in bellows, not zigpy.

walthowd commented 5 years ago

sorry, put this in the wrong place twice -- was typing with a toddler on my lap!

Tested out some exception catching this afternoon, still seeing a few abandoned/orphaned sequence IDs in pending, not sure why. Here is the modifed request function of zigbee/application.py I'm running:

        aps_frame.groupId = t.uint16_t(0)
        aps_frame.sequence = t.uint8_t(sequence)

        try:
            v = await asyncio.wait_for(self._ezsp.sendUnicast(self.direct, nwk, aps_frame, sequence, data), timeout)
        except asyncio.TimeoutError:
            LOGGER.debug("sendunicast timeout %s:%s:%s", nwk, dst_ep, cluster)
            self._pending.pop(sequence)
            send_fut.cancel()
            if expect_reply:
                reply_fut.cancel()
            raise DeliveryError("Message send failure timeout")

        # Wait for messageSentHandler message
        v = await send_fut
        if expect_reply:
            # Wait for reply
            try:
                v = await asyncio.wait_for(reply_fut, timeout)
            except:
                self._pending.pop(sequence)
                raise DeliveryError("Message reply timeout %s" % (sequence, ))
        return v

Did another tortue test of toggling remote bulbs and many other bulbs, and had a few sequence IDs get stuck again. The first in the below log is sequence 152 which starts on line 56 of the home assistant log:

https://www.dropbox.com/s/fwjq9rafsu7wl4z/stuck.txt?dl=0

By the end of the log there are two other stuck sequences in self._pending (12 and 228) but those are after the sequence IDs have wrapped around at least once.

rcloran commented 5 years ago

132 should fix what looks like the common case of this symptom. I'm fairly unsatisfied that I can't figure out what made it regress.

@walthowd : I can't repro this other case. Let me send you a diff with what will probably be useful logging changes.

walthowd commented 5 years ago

@rcloran I can reproduce the issue fairly reliably with my setup. If you send me the diff I can spin up HA with a stripped down ZHA only config and post the log as well as start looking myself. Thanks.

rcloran commented 5 years ago

@walthowd : Can you repro with https://github.com/rcloran/bellows-rcloran/commit/11f9aec24b2ae1a150eb1f250a3f0b7933bee8da ? I'm mostly interested in where we lose the request. It's on top of 0.7.0, which I will push a release of shortly and update hass with. I haven't tested that logging code much, so there might be bugs just in the logging itself ... should be easy fixes if you run into any problems with it.

walthowd commented 5 years ago

@rcloran Here is the first run through the sequence IDs. Sequence ID 43 stuck. I'll leave it running a bit more to see if there is a pattern:

https://www.dropbox.com/s/8inubip7k624keq/forty-three.txt?dl=0

rcloran commented 5 years ago

@walthowd : OK, looks like we're getting an incomingMessageHandler and no messageSentHandler at all. Thanks, this should be enough to work on. I'll try take a look at it next weekend. If anyone else wants to take a look, we should probably call _handle_frame_sent from within _handle_reply if send_fut is not done().

StephenWetzel commented 5 years ago

EDIT: To anyone having the symptoms I describe here the issue is actually this one: https://github.com/zigpy/bellows/issues/37

Not sure if I'm having the same issue but my symptoms seem very similar. Devices work fine for a few hours, then suddenly stop. When that happens my logs have a ERROR (MainThread) [homeassistant.core] Timer got out of sync. Resetting and then fill with WARNING (MainThread) [homeassistant.components.sensor] Updating zha sensor took longer than the scheduled update interval 0:00:30.

Turning on debugging shows this message when it happens DEBUG (MainThread) [bellows.uart] Error frame: b'c20251a8bd7e' and I'm just now noticing that the two recent times it's happened the frame hash was exactly the same b'c20251a8bd7e'.

Like I said I'm not sure if this is the same issue or not, but it also started a few months ago, around the time power consumption monitoring was added. I also have CentraLite 3210-L switches, and my only other zigbeee devices are Sengled bulbs. I can open another issue if you don't think it's related.

Two recent logs: https://pastebin.com/qkNc04ih

https://pastebin.com/zxzWXwRV

techdoutdev commented 5 years ago

@StephenWetzel yup - these are the exact same symptoms I experienced. I'm using the same bulbs and switches as well.

weevilkris commented 5 years ago

I'm getting a similar issue. I have a fairly naked config - the only thing in HA right now is this one zigbee light using zha component and some auto-discovered Rokus. I'm running Python 3.7.0, bellows is 0.7.0, zigpy 0.2.0, on a HUSBZB-1on MacOSX high sierra. Works AWESOME on initial startup, and perhaps for 20-30 minutes. But after some time it dies... If I try to initiate a turn on/off light when it's "dead" I get this error:

2018-09-16 10:49:14 INFO (MainThread) [homeassistant.core] Bus:Handling <Event call_service[L]: domain=light, service=turn_on, service_data=entity_id=light.ge_appliances_zll_light_00031b13_1> 2018-09-16 10:49:14 ERROR (MainThread) [homeassistant.core] Error doing job: Exception in callback SerialTransport._read_ready() Traceback (most recent call last): File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/asyncio/events.py", line 88, in _run self._context.run(self._callback, *self._args) File "/Users/xxx/.homeassistant/deps/lib/python/site-packages/serial_asyncio/init.py", line 106, in _read_ready self._protocol.data_received(data) File "/Users/xxx/.homeassistant/deps/lib/python/site-packages/bellows/uart.py", line 64, in data_received self.frame_received(frame) File "/Users/xx/.homeassistant/deps/lib/python/site-packages/bellows/uart.py", line 86, in frame_received self.data_frame_received(data) File "/Users/xxxx/.homeassistant/deps/lib/python/site-packages/bellows/uart.py", line 107, in data_frame_received self._application.frame_received(self._randomize(data[1:-3])) File "/Users/xx/.homeassistant/deps/lib/python/site-packages/bellows/ezsp.py", line 169, in frame_received assert expected_id == frame_id AssertionError

walthowd commented 5 years ago

@weevilkris It's a lot better in the latest 0.70 release of bellows. You can upgrade and use --skip-pip when starting Home Assistant to use the latest version.

There are still a few places sequence IDs are leaking and @rcloran is still working on them. In the meantime you can use this ugly hack to keep working. Add the following lines after line 218 in bellows/zigbee/applicaton.py

        LOGGER.debug("ASSERTDEBUG - sequence=%s", sequence)
        for pending in list(self._pending):
            LOGGER.debug("ASSERTDEBUG - pending=%s", pending)
            diff = abs(sequence - pending)
            if diff > 50 and diff < 100:
               self._pending.pop(pending)
               LOGGER.debug("ASSERTDEBUG-REMOVE - Removing hung pending=%s", pending)

EDIT: Add list function create dictionary copy to avoid race issues

weevilkris commented 5 years ago

@walthowd I am in fact using bellows 0.7.0 (manually upgraded myself - had to to even get zha to work at all with python 3.7 -- wink module also crashes out on startup with a similar bug but that's another story) ... but thanks for the quick feedback!

When there's a time out - I suspect some are due to range issues, but I am not sure if it's that or something else -- I also get these errors that seem to not be caught.

2018-09-18 09:16:18 ERROR (MainThread) [homeassistant.core] Error doing job: Task exception was never retrieved Traceback (most recent call last): File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/homeassistant/helpers/service.py", line 224, in _handle_service_platform_call await getattr(entity, func)(**data) File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/homeassistant/components/light/zha.py", line 127, in async_turn_off await self._endpoint.on_off.off() File "/Users/xxxx/.homeassistant/deps/lib/python/site-packages/zigpy/device.py", line 89, in request expect_reply=expect_reply, File "/Users/xxxx/.homeassistant/deps/lib/python/site-packages/bellows/zigbee/application.py", line 245, in request v = await asyncio.wait_for(reply_fut, timeout) File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/asyncio/tasks.py", line 419, in wait_for raise futures.TimeoutError() concurrent.futures._base.TimeoutError

Adminiuga commented 5 years ago

I just would like to report that I haven't had any pending TSN since Sep 02 with the new bellows. Prior to that I was having around 2 to 5 hanging TSNs

holelattanuttin commented 5 years ago

@Adminiuga what devices do you have?

Adminiuga commented 5 years ago

@holelattanuttin way too many. Mostly 45857GE & 45856GE wall switches, bunch of Xiaomi aqara temp/pressure/humidity sensors, aqara motion sensor, aqara leak sensors, OSRAM & Ledvance color bulbs. Some color bulbs are regularly powered off from the switch, so there're plenty of DeliveryErrors.

Xiaomi are most problematic for me, as those are very picky to which parent they join, so xiaomi devices would come offline if joined to a wrong parent.

walthowd commented 5 years ago

@Adminiuga Off topic, but I'm still leaking some TSNs and also getting some of the new exception handling in bellows that throws an error from light/zha for turn_off and async_turn_off.

Since you contribute to HA and the ZHA code a bit would it be too much trouble to extend the logging in light/zha.py to include the entity_id? Currently the error just spits out the bellows error, but it makes it pretty hard to follow the logs and find the right entity_id. Changing the logging lines on line 117 and 129 to something like would help quite a bit.

_LOGGER.error("Unable to turn the light (%s) off: %s", self.entity_id, ex)

I'd submit the pull request myself, but have never forked and submitted and haven't signed the CLA, etc.

Adminiuga commented 5 years ago

@walthowd I could author a PR but no promises it would get accepted upstream. How are you running Hass? Is it hassio docker, hassbian or anything else? I would recommend starting with git, but that would require some familiarity with Linux console. Have you also tried yoda-x's zha_new? I think he handles uart resets you n his version, which could be the culprit if running on raspberry.

walthowd commented 5 years ago

@Adminiuga I'd appreciate it if you would have some time, and I think the community would benefit from the better logging. :)

I'm running straight home assistant on an iMac with Python 3.6. Hardware is fairly powerful for what it's being asked to do -- I've been testing Yoda'x zha_new off and on, and I'm looking forward to seeing some it's changes get merged into the official release.

Adminiuga commented 5 years ago

@walthowd a few questions:

do you have git installed
are you running master or dev branches of home assistant?

I have to branches for you to try: light debug logging against master tree and light debug logging against dev tree can you give it a test drive?

walthowd commented 5 years ago

@Adminiuga I'll test those and leave feedback over in the pull request to HA you just submitted, thank you!

techdoutdev commented 5 years ago

FYI - I upgraded to hassio 78.3 and zha was working okay. I was still getting the occasional warnings/error in the log (unexpected send, timer out of sync) but nothing that caused things to stop working. Updated to 79.1 that had a handful of new zha features and all devices became unresponsive within hours - just like the issues that started this thread. I'm downgrading to 78.3 again as I need this lights to work, but happy to help provide any info that I can.

wixoff commented 5 years ago

FYI - I upgraded to hassio 78.3 and zha was working okay. I was still getting the occasional warnings/error in the log (unexpected send, timer out of sync) but nothing that caused things to stop working. Updated to 79.1 that had a handful of new zha features and all devices became unresponsive within hours - just like the issues that started this thread.

Same here. I opened this issue way-back-when, eventually tried the yoda-x zha_new component for a while (which was more robust, but still had bulbs disappear after a while), and now i'm back on a stock version of 0.79.1. I'm losing ALL of my bulbs after just a few hours. The log isn't especially revealing, but when i start getting the "Updating zha sensor took longer than the scheduled update interval 0:00:30" error repeating in the log every 30 seconds is about when the bulbs can't be controlled any longer.

Here's a log segment (with warnings from other components removed):

https://paste.drhack.net/?2b888c5c4e72b052#GdXNYNuihSukWIduZ4tAUiR2arhs64cKJHHR163+H/k=

It takes about three hours here from startup to zha failure. I don't think anybody operated any zha bulbs in that interval. They are mostly on timed automations.

As I noted in my first post, my zigbee door sensors still work even after all the bulbs quit.

I'll restart with additional logging enabled, but that's all i have for now.

tradiuz commented 5 years ago

To add another datapoint to this. I have 6 Sengled E11-G13 (Sengled Element Classic), and they'll work with HA following a reboot, but stop responding after less than 8 hours. None of the zha commands from HA work once it goes into the failed state.

It's been an issue in all the recent builds (0.70+ it feels like), and I'm currently running 0.79.3 in docker with bellows 0.7.0

wixoff commented 5 years ago

Sigh. Since my prior comment, I installed 0.79.3 and for several days had the exact same problem. Worked after restart, then all bulbs failed after some variable number of hours.

Then I went away for a week-long business trip, and it has been working flawlessly the whole time. I don’t get it. Maybe I emit 2.4 GHz interference; I guess I’ll find out in a few hours when I get back.

Upthread I noted that my door sensors continued to work even when the bulbs dropped off one at a time in older versions. Now, when all of the bulbs go unreachable at the same time, that’s still true. The door sensors are quite reliable, even though they use CR2032 coin cells and are much farther away from the USB stick, and with walls in the way.

I’ll continue to watch this, I suppose.

code-in-progress commented 5 years ago

Just to chime in, I've been having the same exact issue on 0.80.3 as well. I've had the same observation where after a few hours, almost everything on the zha side just dies without rhyme or reason. However, also, as @wixoff noted, most of my zigbee sensors continue to operate.

valexi7 commented 5 years ago

Just to chime in, I've been having the same exact issue on 0.80.3 as well. I've had the same observation where after a few hours, almost everything on the zha side just dies without rhyme or reason. However, also, as @wixoff noted, most of my zigbee sensors continue to operate.

I have also this same problem with 0.80.3. I think it started, when I upgraded Hassbian from 0.80.0 to 0.80.3. I don't recall having this issue with old version.

Symptoms: When I log in to HA I see the zha lights all offline. Logs say:

Log Details (ERROR)
Wed Oct 24 2018 17:20:30 GMT+0300 (Eastern European Summer Time)

Error doing job: Task exception was never retrieved
Traceback (most recent call last):
  File "/usr/lib/python3.5/asyncio/tasks.py", line 239, in _step
    result = coro.send(None)
  File "/srv/homeassistant/lib/python3.5/site-packages/homeassistant/helpers/service.py", line 224, in _handle_service_platform_call
    await getattr(entity, func)(**data)
  File "/srv/homeassistant/lib/python3.5/site-packages/homeassistant/components/light/zha.py", line 152, in async_turn_off
    res = await self._endpoint.on_off.off()
  File "/srv/homeassistant/lib/python3.5/site-packages/zigpy/device.py", line 89, in request
    expect_reply=expect_reply,
  File "/srv/homeassistant/lib/python3.5/site-packages/bellows/zigbee/application.py", line 213, in request
    assert sequence not in self._pending
AssertionError

Restarting Home Assistant server fixes the issue until after couple of hours this error occurs.

code-in-progress commented 5 years ago

Just to chime in, I've been having the same exact issue on 0.80.3 as well. I've had the same observation where after a few hours, almost everything on the zha side just dies without rhyme or reason. However, also, as @wixoff noted, most of my zigbee sensors continue to operate.

I have also this same problem with 0.80.3. I think it started, when I upgraded Hassbian from 0.80.0 to 0.80.3. I don't recall having this issue with old version.

Symptoms: When I log in to HA I see the zha lights all offline. Logs say:
Log Details (ERROR)
Wed Oct 24 2018 17:20:30 GMT+0300 (Eastern European Summer Time)

Error doing job: Task exception was never retrieved
Traceback (most recent call last):
  File "/usr/lib/python3.5/asyncio/tasks.py", line 239, in _step
    result = coro.send(None)
  File "/srv/homeassistant/lib/python3.5/site-packages/homeassistant/helpers/service.py", line 224, in _handle_service_platform_call
    await getattr(entity, func)(**data)
  File "/srv/homeassistant/lib/python3.5/site-packages/homeassistant/components/light/zha.py", line 152, in async_turn_off
    res = await self._endpoint.on_off.off()
  File "/srv/homeassistant/lib/python3.5/site-packages/zigpy/device.py", line 89, in request
    expect_reply=expect_reply,
  File "/srv/homeassistant/lib/python3.5/site-packages/bellows/zigbee/application.py", line 213, in request
    assert sequence not in self._pending
AssertionError
Restarting Home Assistant server fixes the issue until after couple of hours this error occurs.

I setup a cron to restart HA every 2 hours for now to keep things running until a fix comes out.

Out of curiosity, do you have both lights and sensors on your zha instance or just lights? I had just ported over a bunch of Hue bulbs a few weeks ago (right around when 0.80.0 came out) and didn't have any issues. For now, I've also ported them back over to my Hue hub just to keep them running (wife acceptance factor was dropping quickly).

valexi7 commented 5 years ago

@code-in-progress I have only 3 Ikea lights in zha instance. Then I have one power switch, which is Z-Wave plus and it's working fine, while zha lights are "offline".

untergeek commented 5 years ago

Just to add an additional datapoint or two, here's my experience.

I use the HUSBZB-1 stick. Z-Wave works perfectly.

Terminology: I will use the term "Native" to mean HA running as a python process directly (venv or otherwise), and "HASS" or "Docker" to mean a Dockerized solution.

Datapoints

Zigbee devices in my home

CentraLite plugs
CentraLite/SmartThings Open/Close sensors
CentraLite/SmartThings Motion sensors
Sengled dimmable bulbs
Cree dimmable bulbs

Native (not Docker/HASS)

When HA is running in Ubuntu 16.04 or 18.04, either with the system python or a compiled 3.6 version, in or out of a venv, Zigbee support is mostly okay.

I have experienced intermittent issues in this use case, particularly with CentraLite plugs (SmartThings, Iris, CentraLite).

Removing all CentraLite plugs from HA reduced issues with my other Zigbee devices. I almost never have to restart HA to force redetection of Zigbee devices. When I did have to restart, the devices that gave me the most problems with redetection (when still included in HA) were the CentraLite plugs.

HASS on Le Potato ARM board (similar to Raspberry Pi)

HASS installation was a breeze. Everything was detected, but there the fun stopped.

The observed behavior described in this GitHub issue occurred within hours. Workarounds like restarts fixed it temporarily, just as others have observed.

Using HASS itself may or may not have been the problem. It may have just been the ARM hardware itself.

Removing the CentraLite plugs from HASS did not eliminate the problem as it did in the Native approach, but the period between needed restarts seemed to be longer without them.

Suppositions

I would love more data points to make better hypotheses, guesses, assumptions, and suppositions, but here goes. Hopefully others following this issue can pile on with added insight.

Hypothesis 1: HUSBZB-1 and Docker do not mix

I do not pretend to know all of the reasons this might be the case, but until I try to run HA as a Native process (not in Docker) on an ARM machine and determine whether Native vs. HASS behaves similarly—which I assume it will—the evidence points to Docker somehow playing a part in this.

Hypothesis 2: Tiny ARM boards are the issue

I don't know what hardware others are using, but it might be useful to add those data points, as it could help narrow things down, since my experience is that the HUSBZB-1 seems to work acceptably in Native mode on an older Intel laptop.

Admission: I have not yet tested whether the HUSBZB-1 works similarly in Native mode on an ARM machine. I would like to add this data point, though perhaps someone else has it already.

Hypothesis 3: Poison Devices

Are some devices more "toxic" to zigpy/bellows than others? This issue does seem to also be tied to the async implementation in python, but how? Why are (in my experience) the CentraLite plugs problematic?

Hypothesis 4: It's `async`'s fault

Something about how Python's async—or zigpy/bellow's implementation of it—is working (or not) seems to play a part. How does Docker tie into this? It seems possible that something about timing and system clocks affects the combination of zigpy/bellows and Docker more than running Natively.

untergeek commented 5 years ago

It would be very curious to see if things are different with Python 3.7, which apparently updates how async behaves. HomeAssistant 0.80 was supposed to add support for Python 3.7, though I do not know if the dependent libraries (particularly zigpy and bellows) will have been updated similarly.

walthowd commented 5 years ago

The root cause is uncaught exceptions which leak the sequence IDs. 256 sequence IDs are available and each zha operation receives the next ID. Once all the 256 IDs have leaked and been allocated/orphaned, all subseqent operations will fail until ZHA or HA is restarted.

The behaviour will be exacerbated by a weak Zigbee mesh, a chatty network or certain devices. @rcloran fixed some of the leaks in the bellows updates and there are currently a few more places things where exceptions need to be caught.

code-in-progress commented 5 years ago

@walthowd Great explanation. That actually makes perfect sense.

untergeek commented 5 years ago

@walthowd That's great news. My question is kind of moot in the face of that, but I would still like to understand why Docker/HASS made that worse. Do you have anything on that subject?

ryanwinter commented 5 years ago

Are the additional places known, or is more info needed from us users?

walthowd commented 5 years ago

@untergeek I'm not really that up to speed on docker internals, but I don't believe it should exacerbate the problem unless it's introducing latency via some additional overhead. From my rough understanding, that should not be the case -- but I also don't know how the serial device is passed through to docker.

@ryanwinter rcloran talks about some of the work to be done in the above comment here: https://github.com/zigpy/bellows/issues/124#issuecomment-418249612

....and I have an above comment where you can add some debug lines to see how your sequence IDs are leaking.

Adminiuga commented 5 years ago

Hypothesis 2: Tiny ARM boards are the issue

I think this is the most common factor between the reports. I think had similar issues when I've just started with hass (back in Feb 2018), but haven't had those after I switched to Intel NUC. On Raspberry PI i was running hass in Python venv, so I don't think docker is the root cause and it certainly not helping it. Switched to NUC cause restarting hass was taking too long. Still planning to try it on Odroid, as I wanted to see how much faster e-mmc is going to be

Hypothesis 3: Poison Devices

Again, I don't think it is the root cause, but very likely those exacerbate the problem. Not as much as being poisonous, but having a weak network and additional timeouts.

Also, in current implementation only HUSBZB is opened to permit joining for new devices, which means that if the new device is not a zigbee router (Sengled bulbs are not zigbee routers, are they), then it could only join with HUSBZB being their parent device, in other words having extra zigbee routers doesn't help. Need to find some time to finish PR for broadcast zdo support...

code-in-progress commented 5 years ago

@Adminiuga I'm not 100% certain about ARM being an issue though. I'm currently running HASS via Docker on an Intel PC (see below) and still having issues on the zha side. I also tried it using HASS via venv and had the exact same issues on both rPis and my home server. The only constant between setups was the HUSBZB-1.

Referring to your statement on poison devices, I have a total of 9 zha devices right now. 3 of those should be repeaters (my plugs). In terms of range, it shouldn't be that big of a deal as the farthest plug is about 30' from my stick. However, what you are suggesting about the HSUBZB being a parent only device makes a lot of sense. To test things out, I removed all my Hue bulbs, my Sengled bulbs and my Osram bulbs and just left motion sensors, contact sensors, and plugs.

To be honest, I'm about ready to just ditch ZHA altogether and go with WiFi bulbs and z-wave+ for all my sensors. :-(

This is the CPU for the machine that I have HASS running on: root@homeserver:/media/homeassistant# lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 8 On-line CPU(s) list: 0-7 Thread(s) per core: 2 Core(s) per socket: 4 Socket(s): 1 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 60 Model name: Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz Stepping: 3 CPU MHz: 3604.781 CPU max MHz: 4000.0000 CPU min MHz: 800.0000 BogoMIPS: 7183.40 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 8192K NUMA node0 CPU(s): 0-7 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm epb invpcid_single kaiser tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid xsaveopt dtherm ida arat pln pts

root@homeserver:/media/homeassistant# lshw -short

                                  system         OptiPlex 9020 (OptiPlex 9020)

/0 bus 00V62H /0/0 memory 64KiB BIOS /0/3a processor Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz /0/3a/3b memory 256KiB L1 cache /0/3a/3c memory 1MiB L2 cache /0/3a/3d memory 8MiB L3 cache /0/3e memory 16GiB System Memory

untergeek commented 5 years ago

I think the ID leak is the real cause. I have great hopes the HUSBZB-1 will continue to work

code-in-progress commented 5 years ago

I think the ID leak is the real cause. I have great hopes the HUSBZB-1 will continue to work

I don't think the HUSBZB-1 would not continue to work, it just needs to some love from the bellows/zigpy side. As a z-wave+ controller, the stick is AWESOME.

zigpy / bellows

zha lights work OK following homeassistant.restart, but stop responding within hours #124

132 should fix what looks like the common case of this symptom. I'm fairly unsatisfied that I can't figure out what made it regress.

Datapoints

Zigbee devices in my home

Native (not Docker/HASS)

HASS on Le Potato ARM board (similar to Raspberry Pi)

Suppositions

Hypothesis 1: HUSBZB-1 and Docker do not mix

Hypothesis 2: Tiny ARM boards are the issue

Hypothesis 3: Poison Devices

Hypothesis 4: It's `async`'s fault

zigpy / bellows

zha lights work OK following homeassistant.restart, but stop responding within hours #124

132 should fix what looks like the common case of this symptom. I'm fairly unsatisfied that I can't figure out what made it regress.

Datapoints

Zigbee devices in my home

Native (not Docker/HASS)

HASS on Le Potato ARM board (similar to Raspberry Pi)

Suppositions

Hypothesis 1: HUSBZB-1 and Docker do not mix

Hypothesis 2: Tiny ARM boards are the issue

Hypothesis 3: Poison Devices

Hypothesis 4: It's async's fault

Hypothesis 4: It's `async`'s fault