zigpy / zigpy-deconz

A library which communicates with Deconz radios for zigpy
GNU General Public License v3.0
87 stars 20 forks source link

ZHA Does not respond properly on "Restart" #37

Closed vinmar1 closed 5 years ago

vinmar1 commented 5 years ago

Last know working 0.91.4

Intel NUC / Ubuntu 18.04.2 / Docker / Hassio / HA 0.92.0

Conbee USB Native ZHA Integration current firmware 262f0500

arch x86_64
dev false
docker true
hassio true
os_name Linux
python_version 3.7.3
timezone America/New_York
version 0.92.0
virtualenv false

Zigbee issues from a HA Restart - None from a Power Up restart

Scenario 1:

After performing a restart from the Configuration Panel all looks good in the GUI, everything appears as online. 1: After waiting at least 15 minutes for all devices to reconnect the samjin motion sensors do not respond at all. 2: The Cree lights will respond to the toggles but immediately go out of sync with the GUI.

debug log of initial "Restart"

home-assistant restart.zip

These initial errors are seen on startup

2019-04-26 09:13:03 WARNING (MainThread) [homeassistant.loader] You are using a custom integration for lutron_caseta_pro which has not been tested by Home Assistant. This component might cause stability problems, be sure to disable it if you do experience issues with Home Assistant. 2019-04-26 09:13:13 WARNING (MainThread) [zigpy_deconz.zigbee.application] Error while sending frame: 0xe9 2019-04-26 09:13:13 WARNING (MainThread) [zigpy_deconz.zigbee.application] Unexpected transmit confirm for request id 8, Status: 0x00, 8 2019-04-26 09:13:13 WARNING (MainThread) [zigpy_deconz.zigbee.application] Error while sending frame: 0xe9 2019-04-26 09:13:23 WARNING (MainThread) [homeassistant.components.sensor] Setup of platform octoprint is taking over 10 seconds. 2019-04-26 09:13:43 WARNING (MainThread) [zigpy_deconz.zigbee.application] Failed to receive transmit confirm for request id: 8

Scenario 2:

From a Power Down and Up Restart system works well with no apparent issues.

1: Samjin motion sensors and Cree lights are immediately available and operate as expected.

debug log of Power Up Restart

home-assistant power up.zip

These initial errors are seen on startup - Note that the LUtron Custom Component Error is not there for whatever reason.

Setup of platform zha is taking over 10 seconds. 11:51 AM main.py (WARNING) Failed to receive transmit confirm for request id: 9 11:51 AM components/zha/core/helpers.py (WARNING) Failed to receive transmit confirm for request id: 8 11:51 AM components/zha/core/helpers.py (WARNING) Failed to receive transmit confirm for request id: 3 11:51 AM components/zha/core/helpers.py (WARNING) Setup of platform octoprint is taking over 10 seconds. 11:50 AM main.py (WARNING) Unexpected transmit confirm for request id 9, Status: 0x00, 9 11:50 AM main.py (WARNING) Unexpected transmit confirm for request id 8, Status: 0x00, 8 11:50 AM main.py (WARNING) Error while sending frame: 0xd0 11:50 AM components/zha/core/helpers.py (WARNING) - message first occured at 11:50 AM and shows up 3 times Error while sending frame: 0xe9 11:50 AM components/zha/core/helpers.py (WARNING) Unexpected transmit confirm for request id 3, Status: 0x00, 3 11:50 AM main.py (WARNING) Endpoint: job Failed to update OctoPrint status. Error: HTTPConnectionPool(host='192.168.1.103', port=80): Max retries exceeded with url: /api/job (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f3457d745f8>: Failed to establish a new connection: [Errno 113] Host is unreachable')) 11:50 AM components/octoprint/init.py (ERROR) Endpoint: printer Failed to update OctoPrint status. Error: HTTPConnectionPool(host='192.168.1.103', port=80): Max retries exceeded with url: /api/printer (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f345b6d3198>: Failed to establish a new connection: [Errno 113] Host is unreachable')) 11:50 AM components/octoprint/init.py (ERROR)

vinmar1 commented 5 years ago

UPDATE:

Several hours after the last power down restart the Samjin motion sensors failed to respond, all 3 of them but they showed as online in the GUI. I could toggle the Cree lights on and off but the GUI toggle button would go back to the off position no matter what state the light was in.

There were no error or warning messages at all but here is the log.

home-assistant.zip

Adminiuga commented 5 years ago

I only looked at the tail of the log, last two request:

2019-04-26 18:18:22 DEBUG (MainThread) [homeassistant.components.websocket_api.http.connection.139862780433408] Received {'type': 'call_service', 'domain': 'light', 'service': 'turn_on', 'service_data': {'entity_id': 'light.bsmnt_util_w_light'}, 'id': 33}

Turn On Service Web request #33

2019-04-26 18:18:22 DEBUG (MainThread) [zigpy_deconz.api] Command aps_data_request (18, 214, 0, <DeconzAddress address_mode=2 address=57336>, 10, 260, 6, 1, b'\x01\xd6\x01', 0, 0)

Goes out as Zigbee request #214

[snip]
2019-04-26 18:18:22 DEBUG (MainThread) [zigpy_deconz.uart] Frame received: 0x04240013000c002ad602f8df0a010000000000
2019-04-26 18:18:22 DEBUG (MainThread) [zigpy_deconz.api] APS data confirm response for request with id 214: 00

Stick reports that request #214 was successfully sent

Web request to turn off the light id#34

2019-04-26 18:18:23 DEBUG (MainThread) [homeassistant.components.websocket_api.http.connection.139862780433408] Received {'type': 'call_service', 'domain': 'light', 'service': 'turn_off', 'service_data': {'entity_id': 'light.bsmnt_util_w_light'}, 'id': 34}

Goes out as Zigbee req #215

2019-04-26 18:18:23 DEBUG (MainThread) [zigpy_deconz.api] Command aps_data_request (18, 215, 0, <DeconzAddress address_mode=2 address=57336>, 10, 260, 6, 1, b'\x01\xd7\x00', 0, 0)
2019-04-26 18:18:23 DEBUG (MainThread) [zigpy_deconz.uart] Frame received: 0x04260013000c002ad702f8df0a010000000000
2019-04-26 18:18:23 DEBUG (MainThread) [zigpy_deconz.api] APS data confirm response for request with id 215: 00

Stick report that zigbee request was successfully sent again, this implies Zigbee stack received ACK from the far end device receiving our frame.

but 10s after we don't get a response, so it fails for both web requests #33 & #34

2019-04-26 18:18:32 DEBUG (MainThread) [homeassistant.components.websocket_api.http.connection.139862780433408] Sending {'id': 33, 'type': 'result', 'success': True, 'result': None}
2019-04-26 18:18:32 DEBUG (MainThread) [homeassistant.components.zha.core.channels] 0xdff8:10:0x0006: command failed: on exception: 
2019-04-26 18:18:32 DEBUG (MainThread) [homeassistant.components.zha.light] light.bsmnt_util_w_light: turned on: {'on_off': False}
2019-04-26 18:18:33 DEBUG (MainThread) [homeassistant.components.websocket_api.http.connection.139862780433408] Sending {'id': 34, 'type': 'result', 'success': True, 'result': None}
2019-04-26 18:18:33 DEBUG (MainThread) [homeassistant.components.zha.core.channels] 0xdff8:10:0x0006: command failed: off exception: 
2019-04-26 18:18:33 DEBUG (MainThread) [homeassistant.components.zha.light] light.bsmnt_util_w_light: turned off: False

Try the following:

  1. run Conbee of an USB extension cable
  2. start deconz and verify what channel it operates on. Not possible to change channels via zigbe, but should be able to change channel via deConz. Run it on channel 15 or 20
vinmar1 commented 5 years ago

I took steps to avoid WiFi interference 2 days ago. Prior to this I know there was interference as my WiFi logs had many disconnect and reconnect entries. They have all gone away since this fix. I also do not have any cordless phones in the house.

1: Conbee is on an extension cable about 1 meter long. 2: Put Conbee in my Windows PC and used Phoscon to move it to channel 25. 3: Verified Conbee stays on channel 25 after being removed from USB port and reinserted. 4: Put Conbee back into NUC and restarted. 5: Did a clean install of HA and added all devices back in. 6: Changed Router WiFi to channel 7 to avoid interference.

I used this chart to avoid interference. Also using a WiFi analyzer on my smartphone there is nothing using WiFi near me above channel 7 so there should be no interference as I have things now. I can try moving Conbee to channel 11 and Wifi to 11 but according to my WiFi analyzer there are routers on the lower bands at 100+ meters.

Also this happened again just a couple hours after the last restart again with no errors. This is the last 0.92.0 log, after the samjins froze I manipulated the lights again so the end of the log should show that.

home-assistant 2.zip

image

Adminiuga commented 5 years ago

Well, you did pretty much everything to eliminate the interference. At this point I'm out of ideas, other than use a Zigbee sniffer to confirm whether conbee gets a response or not. IMO this is a second report of the similar behavior.

Just as a shot in the dark, can you modify /config/deps/lib/python3.7/site-packages/zigpy_deconz/zigbee/application.py file line 84 and change timeout=10 to timeout=30 ?

vinmar1 commented 5 years ago

Sudden thought this morning while having my coffee and staring at my OSRAM Lightify hub that I have no control over. When it powers up it is supposed to auto find the channel with the least interference but since making all the changes to the Conbee and WiFi I have not reset it. I pulled it for 30 minutes and plugged it back in. I used another outlet and was able to get more distance from the Conbee. Before I make the suggested changes to the timeout I'll see if moving it helps. Eventually I want to get rid of the OSRAM but have been hesitant due to the issues. Why do I have a feeling I overlooked this and it is the problem.... uuggghhh! If I get get this to be stable the OSRAM hub will be history.

I also have a RPI 3B+ and a HUSBZB-1 doing nothing so I think I can get a Zigbee sniffer together probably with Wireshark.

Sorry to make you all jump through hoops on this but if you see nothing wrong in the code and if the initial startup warnings don't indicate anything then at least I know I have to chase after the environment.

vinmar1 commented 5 years ago

Had another instance of the same symptom... I just ran the upgrade to 0.92.1 anyway.

I looked for /config/deps/lib/python3.7/site-packages/zigpy_deconz/zigbee/application.py file line 84 and change timeout=10 to timeout=30 ?

Tried through SSH, Logged into ubuntu as root and used the file search, tried a term window as root in ubuntu even tried searching from visual code studio... I'm not even finding python3.7... how on earth do I get to that file?

Adminiuga commented 5 years ago

ah, /config/ folder is how it is seen in docker container, but not sure where it is when you are running hassio Do you have deps directory in your home assistant configuration directory on Ubuntu?

EDIT: Uh, I have a suspicion it shows up in the /config/deps folder only if you exec pip install zigpy-deconz on the container, cause I see it on my plain HA container, but not on HassOS

vinmar1 commented 5 years ago

I have /config/deps but it is empty

Adminiuga commented 5 years ago

Yep, it was in my older homeassistant Dockers, but not in the latest image I pulled. Just wanted to bump the reply timeout to make sure we're not missing any from the end devices.although there's little chance it being the root cause.

vinmar1 commented 5 years ago

I have a suspicion something is still causing interference... I'm working on a Zigbee sniffer using a Pi and Husbzb-1. Hopefully I'll get it running so I can see if there is something I am missing.

Adminiuga commented 5 years ago

Actually if you could get a Zigbee packet capture, that would be great in confirming whether we're getting the actual response or not. Just need to reproduce the scenario in https://github.com/zigpy/zigpy-deconz/issues/31#issuecomment-487241299 while capturing the zigbee traffic. If there's no reply packet, then move the HUSBZB-1 sniffer closer to the light to confirm if the light sends the packet.

vinmar1 commented 5 years ago

I'm working on the sniffer there is a bit of a learning curve getting it running on the Pi. Funny thing that I have noticed... all has been running just fine for 24 hours now. Every restart I see different errors and sometimes the system just seems to run fine. Could there be something that does or does not happen when the system first comes up that could be causing problems. This is the current logfile that has been running for well over 24 hours.

home-assistant.zip

UPDATE: Well it finally went belly up again. I'm down to the Conbee and Cree lights just do not like each other at all. I removed the Cree lights from the Conbee and added them to the Lightify hub and I get zero ZHA errors or warnings at startup or any other time so far. There has to be some sort of timing issue between them. Funny thing is the Cree's do work with the Conbee running deConz or at least appear to. My goal was to get rid of the Lightify hub completely but now it looks like the Conbee has to be replaced or I just keep both because the Samjin's will not work on the Lightify hub and probably never will and I already have 3 of them.

I'm not going to try and sort out the issues between Cree and Conbee... closing this ticket if there is more problems I'll start a new one but I bet there isn't.

vinmar1 commented 5 years ago

Just an FYI - Got the sniffer running wasn't too hard at all. Did a write up on it...

https://community.home-assistant.io/t/zigbee-sniffer-with-wireshark-on-that-spare-pi-how-to-build-one/114086

Adminiuga commented 5 years ago

BTW if the issue happens after a restart, then check https://github.com/home-assistant/home-assistant/issues/23719 which could be related

vinmar1 commented 5 years ago

Since I removed the Cree Connect lights from the Conbee and put them on my OSRAM hub this whole thing has been running flawlessly.