zigpy / bellows

A Python 3 project to implement EZSP for EmberZNet devices
GNU General Public License v3.0
179 stars 87 forks source link

zha lights work OK following homeassistant.restart, but stop responding within hours #124

Closed wixoff closed 5 years ago

wixoff commented 6 years ago

I posted something about this in the Community Forum, but I hope it will get better visibility here.

At the moment zha network has seven light bulbs, three door sensors, and one switched outlet. The sensors are reliable, but the lights and switch become unavailable (and unresponsive to commands) after a few hours.

SETUP

Currently the lights are all Osram/Sylvania Lightify (U.S. versions) - five are RGBW A19 and two are two are Tunable White A19. When they work, they work great - almost instant response. The switched outlet is an IRIS v2 plug (reads as CentraLite 3210-L, with the odd dual z-wave/zigbee radios).

I’m not sure what the issue is, because ZHA had been pretty darn reliable for a good number of months. (I also have a Tradfi 1000lm A19 bulb that I was able to include months ago, but I have since reset it, nuked my zigbee.db, and attempted to re-add, and it will no longer show up.) Sometimes repeating the request via the UI several times in a row will cause the bulb to wake up and respond, and eventually even that will change to no responses whatsoever.

I also have three Visonic MCT-340E door/window sensors spread around a fairly large house. Even after the bulbs (and the IRIS plug) quit responding, these sensors still work and are very reliable. The built-in temperature sensor works too, on one of them; the others never change their temperature and one of those is stuck at 32F.

ERRORS

Here’s what an error in the log looks like after the lights stop responding - this represents an attempt to turn off a light via the hass UI:

Mon Jul 09 2018 00:10:40 GMT-0400 (EDT)
Error executing service ServiceCall light.turn_off: entity_id=['light.osram_lightify_a19_rgbw_00a3342d_3']
Traceback (most recent call last):
File "/usr/src/app/homeassistant/core.py", line 1021, in _event_to_service_call
await service_handler.func(service_call)
File "/usr/src/app/homeassistant/components/light/__init__.py", line 362, in async_handle_light_service
await light.async_turn_off(**params)
File "/usr/src/app/homeassistant/components/light/zha.py", line 127, in async_turn_off
await self._endpoint.on_off.off()
File "/usr/local/lib/python3.6/site-packages/zigpy/device.py", line 89, in request
expect_reply=expect_reply,
File "/usr/local/lib/python3.6/site-packages/bellows/zigbee/application.py", line 213, in request
assert sequence not in self._pending
AssertionError

And here is another error, trying to use a scene to turn off four lights (as noted above, the Tradfri light fails because it's currently disconnected):

2018-07-10 07:11:09 WARNING (MainThread) [homeassistant.helpers.state] reproduce_state: Unable to find entity switch.ikea_of_sweden_tradfri_bulb_e26_ws_opal_980lm_fe477406_1
2018-07-10 07:11:09 ERROR (MainThread) [homeassistant.core] Error executing service ServiceCall <light.turn_off: entity_id=['light.osram_lightify_a19_tunable_white_0001d4ac_3', 'light.osram_lightify_a19_tunable_white_0001e6c8_3', 'light.osram_lightify_a19_rgbw_000a381d_3']>
Traceback (most recent call last):
File "/usr/src/app/homeassistant/core.py", line 1021, in _event_to_service_call
await service_handler.func(service_call)
File "/usr/src/app/homeassistant/components/light/__init__.py", line 362, in async_handle_light_service
await light.async_turn_off(**params)
File "/usr/src/app/homeassistant/components/light/zha.py", line 127, in async_turn_off
await self._endpoint.on_off.off()
File "/usr/local/lib/python3.6/site-packages/zigpy/device.py", line 89, in request
expect_reply=expect_reply,
File "/usr/local/lib/python3.6/site-packages/bellows/zigbee/application.py", line 213, in request
assert sequence not in self._pending
AssertionError

There are no other zha-related errors in the log, other than during startup.

I can't imagine this is correct behavior. The five RGBW lights and the IRIS plug are all within 10 feet of the HUSBZB-1 stick, most of them line-of-sight. And yet they drop off just as quickly as the ones further away. And the other two Tunable White bulbs are only another 5 feet past the zigbee outlet switch (which should be a router), but they are behind a modern-contstruction wall (wood studs, wallboard, paint).

As I mentioned in my Community post, I'll update the firmware on the bulbs, but I don't expect improvement because the lights were working pretty well a few hass releases ago.

tbrock47 commented 5 years ago

Jumping on board here. Recently started using HUSBZB-1 and ZHA and would love to see more stability.

Adminiuga commented 5 years ago

'm not 100% certain about ARM being an issue though. I'm currently running HASS via Docker on an Intel PC (see below) and still having issues on the zha side. I also tried it using HASS via venv and had the exact same issues on both rPis and my home server. The only constant between setups was the HUSBZB-1

I think there're few issues at play here and both manifest with ZHA stop responding:

  1. Leaking TSN
  2. HUSBZB-1 getting out-of-sync

for issue #1 you should see in homeassistant.log

    assert sequence not in self._pending
AssertionError

for issue #2 you should be getting error frames in bellows.ezsp.

@code-in-progress It seems that you are still leaking TSN. Double check that you have bellows==0.7.0 installed, collect logs from restart till you get the assertion error and submit it. You could try a quick'n'dirty hack for def get_sequence() in bellows.zigbee.application, make it something like

    def get_sequence(self):
        get_sqn_tries_left = 255
        while get_sqn_tries_left >= 0:
            get_sqn_tries_left -= 1
            self._send_sequence = (self._send_sequence + 1) % 256
            if self._send_sequence not in self._pending:
                break

        if get_sqn_tries_left < 254:
            LOGGER.debug(
                "Got sqn {} on {} try. current/max pending {}/{} trns".
                format(
                    self._send_sequence,
                    255-get_sqn_tries_left,
                    len(self._pending), self._max_pending
                )

            )

        assert get_sqn_tries_left >= 0
        return self._send_sequence

or https://github.com/zigpy/bellows/issues/124#issuecomment-421787305

@tbrock47 you woud need to provide more details than that:

  1. what devices you have
  2. hardware, environment and setup
  3. logs
code-in-progress commented 5 years ago

@Adminiuga Correct. I am still leaking TSNs (already verified that). I'll implement the code fixes tomorrow morning (next chance that I'll have to get coding) and report back.

wixoff commented 5 years ago

I'm not 100% certain about ARM being an issue though. I'm currently running HASS via Docker on an Intel PC (see below) and still having issues on the zha side. I also tried it using HASS via venv and had the exact same issues on both rPis and my home server. The only constant between setups was the HUSBZB-1.

@code-in-progress Well, I'm going to contribute absolutely nothing helpful by observing that my problems are also occurring in Docker on the exact same Intel CPU with the exact same amount of memory and, of course, the same USB stick. (The host system runs Fedora 28 in my case.)

Adding to my comment from October 12, I've been back from my trip for two weeks now, all of the bulbs still work reliably (with one exception, described below), and I have restarted the Docker image once for an unrelated reason.

Some of my bulbs are very slow to respond, though, like 10-15 seconds. Eventually they do respond. What's odd is that turning them on or off is slow, but color/brightness changes are nearly instantaneous.

Now here's the one exception: I put my one IKEA Tradfri bulb back into service some time ago, and it had been working fine. At some point in the last couple of weeks, it disappeared and restarting hass didn't cause it to reappear (as bulbs generally do with this bug). I had to manually reset the bulb and re-join it via zha.permit. Now it's fine again, and it picked up the same entity_id it had before.

tbrock47 commented 5 years ago

'm not 100% certain about ARM being an issue though. I'm currently running HASS via Docker on an Intel PC (see below) and still having issues on the zha side. I also tried it using HASS via venv and had the exact same issues on both rPis and my home server. The only constant between setups was the HUSBZB-1

I think there're few issues at play here and both manifest with ZHA stop responding:

  1. Leaking TSN
  2. HUSBZB-1 getting out-of-sync

for issue #1 you should see in homeassistant.log

    assert sequence not in self._pending
AssertionError

for issue #2 you should be getting error frames in bellows.ezsp.

@code-in-progress It seems that you are still leaking TSN. Double check that you have bellows==0.7.0 installed, collect logs from restart till you get the assertion error and submit it. You could try a quick'n'dirty hack for def get_sequence() in bellows.zigbee.application, make it something like

    def get_sequence(self):
        get_sqn_tries_left = 255
        while get_sqn_tries_left >= 0:
            get_sqn_tries_left -= 1
            self._send_sequence = (self._send_sequence + 1) % 256
            if self._send_sequence not in self._pending:
                break

        if get_sqn_tries_left < 254:
            LOGGER.debug(
                "Got sqn {} on {} try. current/max pending {}/{} trns".
                format(
                    self._send_sequence,
                    255-get_sqn_tries_left,
                    len(self._pending), self._max_pending
                )

            )

        assert get_sqn_tries_left >= 0
        return self._send_sequence

or #124 (comment)

@tbrock47 you woud need to provide more details than that:

  1. what devices you have
  2. hardware, environment and setup
  3. logs

Sorry, I was more or less subscribing to the thread to watch developments. But since you asked and since im here...

Raspberry Pie 3 B+ w/ HUSBZB-1 Hass.io 0.80.3 Two GE Zwave switches (no issues) 5 Sengled LED bulbs using zha.

Some of my most recent logs.

[bellows.zigbee.application] Unexpected response TSN=107 command=1 args=[[&lt;ReadAttributeRecord attrid=0 status=0 value=Bool.true&gt;]]
[homeassistant.core] Timer got out of sync. Resetting
[homeassistant.components.updater] Got unexpected response: None
[bellows.zigbee.application] Unexpected response TSN=112 command=1 args=[[<ReadAttributeRecord attrid=0 status=0 value=Bool.true>]]
[bellows.zigbee.application] Unexpected message send notification
[bellows.zigbee.application] Unexpected response TSN=113 command=1 args=[[<ReadAttributeRecord attrid=0 status=0 value=Bool.true>]]
[bellows.zigbee.application] Unexpected message send notification
[bellows.zigbee.application] Unexpected response TSN=112 command=1 args=[[<ReadAttributeRecord attrid=0 status=0 value=Bool.true>]]
[bellows.zigbee.application] Unexpected message send notification
[bellows.zigbee.application] Unexpected response TSN=113 command=1 args=[[<ReadAttributeRecord attrid=0 status=0 value=Bool.true>]]
[bellows.zigbee.application] Unexpected message send notification
[bellows.zigbee.application] Unexpected message send failure
[homeassistant.core] Timer got out of sync. Resetting
[homeassistant.components.light] Updating zha light took longer than the scheduled update interval 0:00:30
[homeassistant.components.light] Updating zha light took longer than the scheduled update interval 0:00:30
[homeassistant.components.light] Updating zha light took longer than the scheduled update interval 0:00:30
[homeassistant.components.light] Updating zha light took longer than the scheduled update interval 0:00:30
[homeassistant.components.light] Updating zha light took longer than the scheduled update interval 0:00:30
[homeassistant.components.light] Updating zha light took longer than the scheduled update interval 0:00:30
[homeassistant.components.light] Updating zha light took longer than the scheduled update interval 0:00:30
[homeassistant.components.light] Updating zha light took longer than the scheduled update interval 0:00:30
[homeassistant.components.light] Updating zha light took longer than the scheduled update interval 0:00:30
code-in-progress commented 5 years ago

@Adminiuga I'm am currently running bellows 0.7.0:

root@homeserver:/usr/src/app# pip show bellows Name: bellows Version: 0.7.0 Summary: Library implementing EZSP Home-page: http://github.com/zigpy/bellows Author: Russell Cloran Author-email: rcloran@gmail.com License: GPL-3.0 Location: /usr/local/lib/python3.6/site-packages Requires: Click, click-log, pure-pcapy3, pyserial-asyncio, zigpy Required-by:

I added the get_sequence "fix" and I'll monitor during the day.

cmgreenman commented 5 years ago

To add another data point. I'm seeing the exact same symptoms. everything works great for a few hours and then it quits. Pretty much all but one of the zigbee devices are only controlled using automation. I rarely switch things on and off manually or through the UI.

My setup is as follows: Odroid C2+ running Ubuntu 18.04 HUSBZB-1 stick HASS 0.80.0 running in venv Bellows 0.7.0 Python 3.6.6 Zigbee devices: 3x Sylvania Smart+ appliance switches 4x Sylvania Smart+ RGBW BR30 Floods Zwave devices are primarily GE wall switches/dimmers throughout the house with a few battery operated motion sensors, some Aeotec plug in dimmers, a couple cheap monoprice no-name switches, and a couple GoControl dimmable bulbs. Total Zwave network is about 35 devices.

I initially thought the issue was with the stick and was about to buy a Telegesis LRS stick and switch back to my old Z-stick for zwave. One thing I noticed that is new since my last upgrade is that on boot zha errors out not able to open /dev/ttyUSB1. Once the system is up I can restart HASS and everything is fine until the zha devices stop responding. I've tried setting up the systemd service to start way late in the boot sequence but it doesn't seem to matter. I've also noticed that sometimes even the Zwave network start acting up and scenes/automations quit working.

My logger is configured as such: logger: default: warning logs: homeassistant.components.zha: debug

I'm not seeing the Assertion error but I an getting: 2018-10-26 22:52:17 ERROR (MainThread) [homeassistant.core] Timer got out of sync. Resetting

Hope this helps. Please let me know if I need to post additional info to help solve this.

cmgreenman commented 5 years ago

I added the get_sequence "fix" and I'll monitor during the day.

I'm curious, Where do you add the fix? bellows/ezsp.py?

Adminiuga commented 5 years ago

@cmgreenman

My logger is configured as such: logger: default: warning logs: homeassistant.components.zha: debug

add under logs: ident: bellows.zigbee.application: debug and bellows.ezsp: debug

valexi7 commented 5 years ago

I initially thought the issue was with the stick and was about to buy a Telegesis LRS stick and switch back to my old Z-stick for zwave.

@cmgreenman Don't worry. It's not your stick. I have Telegesis ETRX357USB-LRS and I have also Assertion error and Timer got out of sync error.

Setup: Hassbian 0.80.0 Rpi3 B+ Razberry2 Z-wave module Telegesis ETRX357USB-LRS

Adminiuga commented 5 years ago

@cmgreenman

My setup is as follows: Odroid C2+ running Ubuntu 18.04 HUSBZB-1 stick HASS 0.80.0 running in venv Bellows 0.7.0 Python 3.6.6 Zigbee devices: 3x Sylvania Smart+ appliance switches 4x Sylvania Smart+ RGBW BR30 Floods

This is interesting. What else are running on Odroid and what other components do you have enabled in homeassistant? I wonder if anything in homeassistant event loop blocks the loop for too long. Any chance you could run a HASS instance with a very minimal configuration file, like "frontend/config/http" and "zha" sections only in configuration.yaml ?

cmgreenman commented 5 years ago

@cmgreenman

My setup is as follows: Odroid C2+ running Ubuntu 18.04 HUSBZB-1 stick HASS 0.80.0 running in venv Bellows 0.7.0 Python 3.6.6 Zigbee devices: 3x Sylvania Smart+ appliance switches 4x Sylvania Smart+ RGBW BR30 Floods

This is interesting. What else are running on Odroid and what other components do you have enabled in homeassistant? I wonder if anything in homeassistant event loop blocks the loop for too long. Any chance you could run a HASS instance with a very minimal configuration file, like "frontend/config/http" and "zha" sections only in configuration.yaml ?

I'll try the minimal config. Other components I'm using are mqtt for some esp8266 temp/humidity sensors and some wifi floods. Also using nmap device tracker (added after issue started) and Logitech Harmony. My roku TVs show up with the Discovery component as well. The only other process I have running on the Odroid right now is Mosquitto. I did have minidlna but it's currently disabled.

tbrock47 commented 5 years ago

Knock on wood, but ever since my hassio upgrade to 0.81.0, I've had zero issues with zha. I didn't catch any zha changes in the change log however.

valexi7 commented 5 years ago

I have 0.81.0 and still have this issue. @code-in-progress Did the "fix" from @Adminiuga work for you?

code-in-progress commented 5 years ago

@valexi7 Nope. I upgraded to 0.81.0 with hope that it would fix itself and it hasn't. So, for now I've moved all my zha stuff back to my SmartThings hub and turned off zha in HA until there's a definitive fix. The wife acceptance factor was falling quickly, so I didn't have much choice. :(

Adminiuga commented 5 years ago

I've built a test system on cubox-i i2-ex (2x core) which runs a minimal configuration (zigbee and zwave) and turns on/off a light every 10min. I wasn't able to reproduce the problem, but the longest I was able to run is about 12 hours and had to restart it for other reasons. Ordered myself an elelabs zigbee shield so I could leave it running for longer. It is going to be a PITA to pinpoint the issue, but I think the whatever underlying issue is, it is exacerbated by a weak network, because in my test setup all devices were in immediate veracity: 2x zigbee routers and a few "child" devices.

walthowd commented 5 years ago

@Adminiuga I can get you SSH access to my box running a test config to replicate the issue within a few minutes.

cmgreenman commented 5 years ago

Update. Been doing some testing with minimal configs. Everything's ng seems to run great until I enable zwave then it goes back to dying after a couple hours. Without leave enabled it runs great for at least 48 hours or more. With zwave AND zha enabled it only goes a few hours before the zha and zwave stop responding. My mqtt lights all work fine.

On Fri, Nov 2, 2018, 9:49 AM walthowd <notifications@github.com wrote:

@Adminiuga https://github.com/Adminiuga I can get you SSH access to my box running a test config to replicate the issue within a few minutes.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/zigpy/bellows/issues/124#issuecomment-435385990, or mute the thread https://github.com/notifications/unsubscribe-auth/AU8V5LKhh4pD8LvpdxTCMngOF8mH-avvks5urE17gaJpZM4VJkuc .

Adminiuga commented 5 years ago

Update. Been doing some testing with minimal configs. Everything's ng seems to run great until I enable zwave then it goes back to dying after a couple hours

This is an interesting find. I don't have any zwave devices, but could find one zwave device and add it to the mix. This would fall inline with a supposition the raspberry devices might not have enough horse power to run latency sensitive EZSP communication with other components which run in the same loop/thread. I wonder if it would be possible to run a separate hass instance dedicated just to ZHA, so they would run on different cores and link two together.

Can everyone else with the problem report if they are running zwave and how many zwave devices are there?

@Adminiuga I can get you SSH access to my box running a test config to replicate the issue within a few minutes.

No pressure, right :D are you on hassio?

walthowd commented 5 years ago

@Adminiuga I'm actually running straight on Mac OS X with Python 3.6 no venv or any abstractions. Base system is a 20007 iMac with a 2.4 Ghz Intel Core 2 Duo. I am running Z-Wave as well on the HUSBZB-1 with about 20 zwave devices.

I can run some tests without Z-wave enabled and report back.

I'm also serious on the SSH access, no pressure, but if you want some good quick logs of the issue happening it should be easy to see.

adrum commented 5 years ago

I also have 20+ Zwave devices, and about 6 Hue lights all connected to my AMD64 Ubuntu 16.04 box with a HUSBZB-1 stick. When I first migrated my devices over to HA about a month ago, my devices would stop responding after a few hours. After I applied this, it seems to be better, though not completely fixed. Running on HA 0.81.1 via Docker.

Adminiuga commented 5 years ago

@walthowd

I can run some tests without Z-wave enabled and report back.

this would be a good start. Run it without zwave. as the next step, run both zigbee and zwave at the same time but on different hass instances (different config directories for hass) with bare minimum config for zigbee. I'm not sure about internals of HUSBZB-1, but I hope internally Zwave and Zigbee radios are not sharing any resources other than USB line.

techdoutdev commented 5 years ago

FWIW - I have 10 zigbee devices and 6 z-wave devices using the HUSBZB-1 stick. Note that on v67 of HA, everything works well together (zwave and zigbee). It wasn't until v68 and forward the issues started popping up. This leads me to think that zwave isn't the culprit, though it could still be a factor. I've run on both hassbian and docker.

walthowd commented 5 years ago

@Adminiuga I can confirm the sequence IDs still leak with z-wave disabled. I ran a test with a minimal zha only config and leaked two IDs after 26 minutes. This was with toggling two lights on and off every five seconds. I can provide the log if needed.

This is probably about the rate that I typically leak sequence IDs with the exception catching in the latest bellows release -- I was running bellows-0.7.0 and Home Assistant 0.79.2

cmgreenman commented 5 years ago

My installation was Rock solid as well before upgrading to 0.7x.x. I'm running 7 zha devices and about 30 or so zwave devices.

On Fri, Nov 2, 2018, 2:35 PM Alex Dantoft <notifications@github.com wrote:

FWIW - I have 10 zigbee devices and 6 z-wave devices using the HUSBZB-1 stick. Note that on v67 of HA, everything works well together (zwave and zigbee). It wasn't until v68 and forward the issues started popping up. This leads me to think that zwave isn't the culprit, though it could still be a factor. I've run on both hassbian and docker.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/zigpy/bellows/issues/124#issuecomment-435451051, or mute the thread https://github.com/notifications/unsubscribe-auth/AU8V5JRQhcCzNEQ-QfsAZ7CnBVGMKgEoks5urJBtgaJpZM4VJkuc .

holelattanuttin commented 5 years ago

I was running the HUSBZB-1 on a Raspberry PI using Hassio and my Zwave was fine, but my Zigbee devices would stay up for 4 hours tops before not responding with errors even when I upgraded to the latest HA (.0.81.1). I disabled the Zha for awhile after that. I installed HA on an Intel NUC using a python venv and migrated my whole configuration and the enabled ZHA again. I have not had any issues for the past 24 hours. I only have two Iris Motion Detectors, a Single Iris Power Outlet and a single Osram Lightify Light. That is just my anecdotal experience.

ryanwinter commented 5 years ago

I run HA using FreeBSD on a quad Xeon processor, so I have plenty of CPU. I typically have no problems running ZHA for long periods of time (weeks).

The only time I was getting the problems listed in this thread was during an upgrade when disk idling was turned on accidently (I run a RAID of HDD's). When this was happening, I received the exact problems described here where everything would just slow down and stop responding after a short period of time. Turning off power saving on the HDD fixed everything.

My view from this is that the current implementation, or python platform itself, is very sensative to any introduced time delays, whether from a HDD/CPU/Cron job/etc. Finding what is causing those delays can be very tricky.

holelattanuttin commented 5 years ago

So at hour 48 my Zigbee devices all failed on the NUC with the 'delivery failed' message. So it happens on different platforms, but it is easier to replicate (more quickly) on the raspberry pi.

techdoutdev commented 5 years ago

I’ve noticed another odd trend that I can’t explain: I’ve been running HA 78.3 as it appears to work the best compared to later releases. My zha devices do eventually fail, but it takes maybe 4-5 days or so. However when I reboot, it only takes a handful of hours for the device to fail. It’s not until I wipe & restore a Hassio snapshot that I saved shortly after setting up 78.3 initially that I can get another 4-5 days without issue. I’ve replicated this for a couple months now. This leads me to think there is some sort of log being built up that is affecting this. A restart doesn’t clear the log but restoring a snapshot does. Any ideas?

Another trend I’ve noticed is the iris plugs (Centralite 3210) are a common them with folks with issues. Before I get the error of “device taking longer than 30 secs to respond” which indicates my zha network has failed, I get a notification specifically that the centralite unit is taking longer to respond.

Adminiuga commented 5 years ago

How large is your recorder DB? I have a theory that it could be related bro disk IO. I had my test environment running for days without any issue, but one time i decided to copy a larger file to MMC and it locked zha. Try to reconfigure recorder component so it logs only entities you really need and make it hold one 3-4 days of data. I really don't see a point recording sun position or ZhaDevice entities, and those tend to generate event change evri time rssi/lqi changes

techdoutdev commented 5 years ago

@Adminiuga Thanks! It wasn't huge but I'll give it a shot and tweak the recorder settings regardless. Most recently it was only 200MB or so. in older versions it's been greater than a GB without issue.

wixoff commented 5 years ago

I’ve noticed another odd trend that I can’t explain: I’ve been running HA 78.3 as it appears to work the best compared to later releases. My zha devices do eventually fail, but it takes maybe 4-5 days or so. However when I reboot, it only takes a handful of hours for the device to fail. It’s not until I wipe & restore a Hassio snapshot that I saved shortly after setting up 78.3 initially that I can get another 4-5 days without issue.

No ideas here, unfortunately. My experience seems much more random -- sometimes it takes just a few hours before bulbs start dropping out, and other times I'll restart HA (same version, no changes, nothing restored) and they will keep working for days or weeks.

cmgreenman commented 5 years ago

More data. I transitioned my zwave Network to my older Zstick gen5 on the Odroid C2+ and moved the HUSBZB1 to a Raspi 3b. I have both event streams and state streams configured on both instances. The Raspi only has zha enabled plus Mqtt for the statestream/eventstream.

It has been pretty solid since. The zha bulbs are a little sluggish but work using the UI on the Other. The strange thing is my Schlage touchpanel zwave locks started working again. They haven't worked for a long time. I first got them working using the zstick about a year ago but after a couple weeks they quit responding. I had to exclude and re-include them into the zwave network. Later I bought the HUSBZB and got the same results. They would work for a couple weeks then stop. They we're included using a network key and could be controlled using the UI or via automations up until they stop responding. I even went as far as contacting Schlage but HA is not a supported platform. Suddenly, after moving the network back to the zstick, everything is working again.

We'll see how long they work.

On Thu, Nov 29, 2018, 2:45 PM wixoff <notifications@github.com wrote:

I’ve noticed another odd trend that I can’t explain: I’ve been running HA 78.3 as it appears to work the best compared to later releases. My zha devices do eventually fail, but it takes maybe 4-5 days or so. However when I reboot, it only takes a handful of hours for the device to fail. It’s not until I wipe & restore a Hassio snapshot that I saved shortly after setting up 78.3 initially that I can get another 4-5 days without issue.

No ideas here, unfortunately. My experience seems much more random -- sometimes it takes just a few hours before bulbs start dropping out, and other times I'll restart HA (same version, no changes, nothing restored) and they will keep working for days or weeks.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/zigpy/bellows/issues/124#issuecomment-442968091, or mute the thread https://github.com/notifications/unsubscribe-auth/AU8V5CphA6BetIYZUSKI6XcjpLNUgYyhks5u0Dl2gaJpZM4VJkuc .

techdoutdev commented 5 years ago

@Adminiuga - I set my DB to be purged daily and only keep 3 days worth of record. I'll report back with anything new. One thing definitely stood out: my centralite plugs were reporting wattage used constantly even when there was no change. These dominated the db - 80% of it were these.

adamthole commented 5 years ago

Hi Everyone,

I think I am also experiencing this issue. I'm running Home Assistant 0.82.0 and using the HUSBZB1 to talk to 2 Sengled Zigbee light bulbs. I had an automation setup to cycle the 2 lights alternately between red/green every 5 seconds. This would work for a few hours, until it would eventually throw this error:

Error executing service <ServiceCall light.turn_on (c:1cd1e0267261431b8ae43c01dffe9218): color_name=green, entity_id=['light.sengled2']> Traceback (most recent call last): File "/srv/homeassistant/lib/python3.5/site-packages/homeassistant/core.py", line 1177, in _event_to_service_call await service_handler.func(service_call) File "/srv/homeassistant/lib/python3.5/site-packages/homeassistant/components/light/__init__.py", line 270, in async_handle_light_on_service await light.async_turn_on(**pars) File "/srv/homeassistant/lib/python3.5/site-packages/homeassistant/components/light/zha.py", line 108, in async_turn_on duration, File "/srv/homeassistant/lib/python3.5/site-packages/zigpy/device.py", line 89, in request expect_reply=expect_reply, File "/srv/homeassistant/lib/python3.5/site-packages/bellows/zigbee/application.py", line 213, in request assert sequence not in self._pending AssertionError

I changed my automations so the transition would occur every second and now the failure happens within 10 minute or so of restarting the Home Assistant service. Here are the automations:

- id: red-green-transition
  alias: Red Green Transition
  trigger:
  - platform: time
    seconds: '0'
  - platform: time
    seconds: '2'
  - platform: time
    seconds: '4'
  - platform: time
    seconds: '6'
  - platform: time
    seconds: '8'
  - platform: time
    seconds: '10'
  - platform: time
    seconds: '12'
  - platform: time
    seconds: '14'
  - platform: time
    seconds: '16'
  - platform: time
    seconds: '18'
  - platform: time
    seconds: '20'
  - platform: time
    seconds: '22'
  - platform: time
    seconds: '24'
  - platform: time
    seconds: '26'
  - platform: time
    seconds: '28'
  - platform: time
    seconds: '30'
  - platform: time
    seconds: '32'
  - platform: time
    seconds: '34'
  - platform: time
    seconds: '36'
  - platform: time
    seconds: '38'
  - platform: time
    seconds: '40'
  - platform: time
    seconds: '42'
  - platform: time
    seconds: '44'
  - platform: time
    seconds: '46'
  - platform: time
    seconds: '48'
  - platform: time
    seconds: '50'
  - platform: time
    seconds: '52'
  - platform: time
    seconds: '54'
  - platform: time
    seconds: '56'
  - platform: time
    seconds: '58'
  condition:
    condition: or  # 'when dark' condition: either after sunset or before sunrise
    conditions:
      - condition: sun
        after: sunset
      - condition: sun
        before: sunrise
  action:
  - service: light.turn_on
    data_template:
      entity_id: light.sengled1
      color_name: red
  - service: light.turn_on
    data_template:
      entity_id: light.sengled2
      color_name: green

- id: green-red-transition
  alias: Green Red Transition
  trigger:
  - platform: time
    seconds: '1'
  - platform: time
    seconds: '3'
  - platform: time
    seconds: '5'
  - platform: time
    seconds: '7'
  - platform: time
    seconds: '9'
  - platform: time
    seconds: '11'
  - platform: time
    seconds: '13'
  - platform: time
    seconds: '15'
  - platform: time
    seconds: '17'
  - platform: time
    seconds: '19'
  - platform: time
    seconds: '21'
  - platform: time
    seconds: '23'
  - platform: time
    seconds: '25'
  - platform: time
    seconds: '27'
  - platform: time
    seconds: '29'
  - platform: time
    seconds: '31'
  - platform: time
    seconds: '33'
  - platform: time
    seconds: '35'
  - platform: time
    seconds: '37'
  - platform: time
    seconds: '39'
  - platform: time
    seconds: '41'
  - platform: time
    seconds: '43'
  - platform: time
    seconds: '45'
  - platform: time
    seconds: '47'
  - platform: time
    seconds: '49'
  - platform: time
    seconds: '51'
  - platform: time
    seconds: '53'
  - platform: time
    seconds: '55'
  - platform: time
    seconds: '57'
  - platform: time
    seconds: '59'
  condition:
    condition: or  # 'when dark' condition: either after sunset or before sunrise
    conditions:
      - condition: sun
        after: sunset
      - condition: sun
        before: sunrise
  action:
  - service: light.turn_on
    data_template:
      entity_id: light.sengled2
      color_name: red
  - service: light.turn_on
    data_template:
      entity_id: light.sengled1
      color_name: green

Maybe that automation will help someone be able to debug this issue faster. If you have any suggestions for me to try on my end, please let me know.

My logbook is excluding the 2 automations above, as to not fill it with useless information.

logbook:
  exclude:
    entities:
      - automation.green_red_transition
      - automation.red_green_transition
tbrock47 commented 5 years ago

@adamthole I'm using HUSBZB1 on a Pi3B+ and some Sengled bulbs as well and have had a hell of a time with zha reliability. Typically in less than 24 hours, the entire zigbee network just halts, but zwave continues to work without a flaw. I ended up moving all my zigbee devices back to my VeraPlus because of the unreliability of zha, but I would love for zha to be be as rock solid as zwave so I can finally remove Vera from the equation.

I find it hard to believe that the actual issue that we are all experiencing hasn't been identified and resolved. As knowledgeable as all the contributors to this project seem to be, a problem like this is huge when it comes to looking at HA as the center of your home automation network. I would have thought it would be looked at as a high priority issue, but I guess not.

weevilkris commented 5 years ago

Some troubleshooting here on various platforms is leading me towards an IOwait/WOL or sleep issue that I personally was facing.  Specifically, if I have the HUSBZB-1 plugged into a machine that goes into any kind of power saving mode, the zha module dies off inside of an hour or two but zwave is ok.  I get the async errors when it wakes back up, and the only way around it is to restart HA.  This is the case whether or not the machine fully goes to sleep, or just the HD sleeps.  It's also the case if anything touches the USB stick or it gets loose and needs to be re-initiated by the OS.  zwave survives, zha goes byebye.  I had a Mac OSX high sierra with a manually patched 0.7.1 bellows, and when I disabled all possibility of anything ever sleeping it got better.   Migrating my install to an old Core 2 duo linux box with all power saving totally disabled and the usb stick about 10-15 feet from its nearest zigbee neighbor was the "magic recipe" here.  My guess?  The zigbee radio on the stick is more sensitive than the zwave, so when you lose any power for any amount of time or have any IOwait, the radio can't enumerate the zigbee network and you are SOL. So, looking at everybody's messages:

@adamthole I'm using HUSBZB1 on a Pi3B+ and some Sengled bulbs as well and have had a hell of a time with zha reliability. Typically in less than 24 hours, the entire zigbee network just halts, but zwave continues to work without a flaw. I ended up moving all my zigbee devices back to my VeraPlus because of the unreliability of zha, but I would love for zha to be be as rock solid as zwave so I can finally remove Vera from the equation.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

techdoutdev commented 5 years ago

@Adminiuga after making the changes to the recorder to limit the days kept and also excluding the centralite entities, I have yet to encounter zha freezing up. I haven't gone this long since HA v67. I think you may have been on to something though I'm not making the connection in my head how the log history would affect if zha works or not...

walthowd commented 5 years ago

@Adminiuga I'm still running your test branches and I'm seeing a dramatic slow down in orphaned session IDs. I had one week of Home Assistant uptime with normal zigpy/bellows activity and only had 3 orphaned session IDs during that time.

Adminiuga commented 5 years ago

@Adminiuga I'm still running your test branches and I'm seeing a dramatic slow down in orphaned session IDs. I had one week of Home Assistant uptime with normal zigpy/bellows activity and only had 3 orphaned session IDs during that time.

@walthowd for those orphaned TSNs, can you let me know the status of:

the status of the futures might provide hints why those are leaking. I can only think that it misses "messageSent" notifications as the "sent" future is currently not wrapped in wait_for/timeout

@Adminiuga after making the changes to the recorder to limit the days kept and also excluding the centralite entities, I have yet to encounter zha freezing up

I bet it froze right after you posted this message :) I think we're seeing two issues, which might have the same root cause or maybe not

  1. Leaking TSNs -- result in: __assert sequence not in self._pending AssertionError__
  2. ezsp error frames -- before entering into error mode, you can see some "unexpected message send" notification and unexpected incoming messages, almost like something got out of sync

You mention that the problem become worse after 0.67 and we were looking at zigpy/bellows, but now I'm thinking that some change to hass component running in the event loop is exacerbating this issue and Pi based boards could be more prone to this issue because of the slow MMC card io. I'm running a bare hass with minimum components and a single Sengled bulb changing color every second and I can't reproduce the issue. Maybe I'll enable recorder and see if it makes things worse.

techdoutdev commented 5 years ago

HA v68 is when the bellows version was upgraded - I can’t recall the specifics but it’s in the release notes. That’s when the problems started - I figured that had something to do with it.

I think you might be onto something though - for whatever reason HA v78.3 works better for me than later and earlier versions (v68 and later). You point about a component and the I/O rate of pis sounds reasonable.

I’d also note, the issues seem to be exacerbated on a weaker, more spread out network. It may be hard to replicate with a single bulb in good range.

valexi7 commented 5 years ago

You mention that the problem become worse after 0.67 and we were looking at zigpy/bellows, but now I'm thinking that some change to hass component running in the event loop is exacerbating this issue and Pi based boards could be more prone to this issue because of the slow MMC card io.

I don't fully buy the slow speed of the MMC card and the processor of the rpi3. I migrated my HA installation to 8core Odroid-XU3 with 32GB EMMC 5.0 memory module.

This should be plenty enough speed for the zigbee to work? emmc

But still my zha network fails after 2 hours. I have only 3 IKEA GU10 lights in 3 meters distance from HA server.

dmulcahey commented 5 years ago

@valexi7 Yeah that should def work just fine. Do u have trace backs in the logs? Are there any errors related to ZHA?

valexi7 commented 5 years ago

At first there are nothing. When zha goes offline logs are flooded with this:

Log Details (ERROR)
Sun Dec 16 2018 21:31:30 GMT+0200 (Eastern European Standard Time)

Error executing service <ServiceCall light.turn_on (c:c20442317f2c45659067c5b9e7d447ad): entity_id=['light.ikea_of_sweden_tradfri_bulb_gu10_w_400lm_fed8366c_1', 'light.ikea_of_sweden_tradfri_bulb_gu10_w_400lm_feebd210_1', 'light.ikea_of_sweden_tradfri_bulb_gu10_w_400lm_fef2c3e6_1']>
Traceback (most recent call last):
  File "/srv/homeassistant/lib/python3.6/site-packages/homeassistant/core.py", line 1177, in _event_to_service_call
    await service_handler.func(service_call)
  File "/srv/homeassistant/lib/python3.6/site-packages/homeassistant/components/light/__init__.py", line 270, in async_handle_light_on_service
    await light.async_turn_on(**pars)
  File "/srv/homeassistant/lib/python3.6/site-packages/homeassistant/components/light/zha.py", line 125, in async_turn_on
    duration
  File "/srv/homeassistant/lib/python3.6/site-packages/zigpy/device.py", line 89, in request
    expect_reply=expect_reply,
  File "/srv/homeassistant/lib/python3.6/site-packages/bellows/zigbee/application.py", line 213, in request
    assert sequence not in self._pending
AssertionError
Adminiuga commented 5 years ago

I don't fully buy the slow speed of the MMC card and the processor of the rpi3. I migrated my HA installation to 8core Odroid-XU3 with 32GB EMMC 5.0 memory module. This should be plenty enough speed for the zigbee to work?

Well, that was one of the variables needing elimination. I've got Odroid exactly because of emmc, but still haven't had a chance to try it. And frankly I wasn't able to reproduce it even on a weaker 2 core cubox-i. Those MMC performance numbers, are those for random or sequential writes? Also keep in mind that number of CPU cores doesn't really matter as hass is running asyncio loop in a single thread, so bellows/zigpy and all async components are all running in the same thread. So whether you are running it on a two core or 8 core CPU won't have much impact on hass. The CPU core clock speed is going to have much more impact. The reason I'm suspecting mmc performance: I was running @adamthole automation which changes light's color every second and if I was writing a big 250-500MB file to SD card I could clearly see it affecting the timing of color switching when system was flushing the buffers, but even then I still wasn't able to reproduce the issue.

So in your case you are leaking TSNs and I would really love to know the status of the send_future, as I can only think of either it being stuck at await self._ezsp.sendUnicast() or at await send_fut which causes leaked TSNs.

@valexi7 would you be able to install custom zigpy/bellows which do additional debug logging?

Adminiuga commented 5 years ago

@valexi7 & @walthowd and anyone else who can consistently reproduce the issue and can install custom version of zigpy & bellows:

can you install these versions of bellows and zigpy (and any version of hass > 0.68):

and enable debug logging for the following:

logger:
  default: info
  logs:
    bellows.ezsp: debug
    bellows.zigbee.application: debug
    bellows.uart: debug
    homeassistant.components.zha: debug
    zigpy: debug
    zigpy.application: debug
    zigpy.zdo: debug

run it until it fails (or if it doesn't, send 12/24 hours of logs since hass restart) and send me the logs.

valexi7 commented 5 years ago

@Adminiuga I installed your test version this way:

pip3 install -e git://github.com/Adminiuga/bellows.git@leaking-tsn#egg=bellows
pip3 install -e git://github.com/Adminiuga/zigpy.git@dev#egg=zigpy

After restarting HA I get immediately error:

Log Details (ERROR)
Mon Dec 17 2018 12:09:12 GMT+0200 (Eastern European Standard Time)

Error setting up entry /dev/ttyUSB0 for zha
Traceback (most recent call last):
  File "/srv/homeassistant/lib/python3.6/site-packages/homeassistant/config_entries.py", line 249, in async_setup
    result = await component.async_setup_entry(hass, self)
  File "/srv/homeassistant/lib/python3.6/site-packages/homeassistant/components/zha/__init__.py", line 136, in async_setup_entry
    await APPLICATION_CONTROLLER.startup(auto_form=True)
  File "/srv/homeassistant/lib/python3.6/site-packages/bellows/zigbee/application.py", line 78, in startup
    await self.initialize()
  File "/srv/homeassistant/lib/python3.6/site-packages/bellows/zigbee/application.py", line 53, in initialize
    await self._cfg(c.CONFIG_TRUST_CENTER_ADDRESS_CACHE_SIZE, 2)
  File "/srv/homeassistant/lib/python3.6/site-packages/bellows/zigbee/application.py", line 136, in _cfg
    assert v[0] == t.EmberStatus.SUCCESS  # TODO: Better check
AssertionError

Then reverted by:

pip3 install -e git://github.com/Adminiuga/bellows.git#egg=bellows
pip3 install git+https://github.com/Adminiuga/zigpy/tree/master

And no errors.

I added your logger entry, if I get any interesting errors by default versions...

Adminiuga commented 5 years ago

What kind of Zigbee dongle are you using? I had those on ellelbs, running out of memory, because I've bumped up some config parameters.

On Mon, Dec 17, 2018, 05:54 valexi7 <notifications@github.com wrote:

@Adminiuga https://github.com/Adminiuga I installed your test version this way:

pip3 install -e git://github.com/Adminiuga/bellows.git@leaking-tsn#egg=bellows pip3 http://github.com/Adminiuga/bellows.git@leaking-tsn#egg=bellowspip3 install -e git://github.com/Adminiuga/zigpy.git@dev#egg=zigpy

After restarting HA I get immediately error:

Log Details (ERROR) Mon Dec 17 2018 12:09:12 GMT+0200 (Eastern European Standard Time)

Error setting up entry /dev/ttyUSB0 for zha Traceback (most recent call last): File "/srv/homeassistant/lib/python3.6/site-packages/homeassistant/config_entries.py", line 249, in async_setup result = await component.async_setup_entry(hass, self) File "/srv/homeassistant/lib/python3.6/site-packages/homeassistant/components/zha/init.py", line 136, in async_setup_entry await APPLICATION_CONTROLLER.startup(auto_form=True) File "/srv/homeassistant/lib/python3.6/site-packages/bellows/zigbee/application.py", line 78, in startup await self.initialize() File "/srv/homeassistant/lib/python3.6/site-packages/bellows/zigbee/application.py", line 53, in initialize await self._cfg(c.CONFIG_TRUST_CENTER_ADDRESS_CACHE_SIZE, 2) File "/srv/homeassistant/lib/python3.6/site-packages/bellows/zigbee/application.py", line 136, in _cfg assert v[0] == t.EmberStatus.SUCCESS # TODO: Better check AssertionError

Then reverted by:

pip3 install -e git://github.com/Adminiuga/bellows.git#egg=bellows pip3 http://github.com/Adminiuga/bellows.git#egg=bellowspip3 install git+https://github.com/Adminiuga/zigpy/tree/master

And no errors.

I added your logger entry, if I get any interesting errors by default versions...

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/zigpy/bellows/issues/124#issuecomment-447804139, or mute the thread https://github.com/notifications/unsubscribe-auth/AFjmcE9guRtxU9uZ4kmVQgBKNa3uutD6ks5u53gAgaJpZM4VJkuc .

valexi7 commented 5 years ago

@Adminiuga My dongle is Telegesis ETRX3USB-LRS https://www.semiconductorstore.com/cart/pc/viewPrd.asp?idproduct=50564

walthowd commented 5 years ago

@Adminiuga Your bellows/zigpy branch continues to run fine for me on a HUSBZB-1. I'll send you some more logs in 24 hours.