zwave-js / node-zwave-js

Z-Wave driver written entirely in JavaScript/TypeScript
https://zwave-js.github.io/node-zwave-js/
MIT License
750 stars 599 forks source link

All send attempts fail after aborting SendData with timed out callback #6260

Closed AlCalzone closed 1 year ago

AlCalzone commented 1 year ago

https://gist.github.com/robarnold/c94d61f91fa4c78002e87e4805c4ca95 has a log that shows this. We should consider soft-resetting the controller when this happens for a longer stretch of time.

Qwerty1979Swe commented 1 year ago

I have this exact problem. I have been strugeling for days trying to solve this my zwave-js crashes multiple times a day so i really cant use it. I have tried changing from a pc to Rpi4 a new z stick lon and short usb cables and powered usb hub. My log files looks exactly like this one. Do you know of a fix or do i need to restart zwave js multiple times a day?

Many Thanks Jonas

AlCalzone commented 1 year ago

@Qwerty1979Swe can you share your log? On second look the controller in the log above actually failed to send a callback before the infinite loop.

AlCalzone commented 1 year ago

Another log: https://github.com/home-assistant/core/files/12593063/zwavejs_current.log Again, controller got stuck and failed to callback. After that, all send attempts fail immediately.

wljohnson05 commented 1 year ago

This could be the same issue I've had since the beginning of August as well.

wljohnson05 commented 1 year ago

I just wanted to update this thread with my specs since I'm still having the problem...

Home Assistant is running in a VM on UnRAID version 6.12.4 Home Assistant 2023.9.2 Supervisor 2023.09.2 Operating System 10.5 Frontend 20230911.0 - latest

zwave-js-ui: 8.25.1 zwave-js: 11.14.2

Aeotec Z-Stick 7 - firmware version 7.19.3

AlCalzone commented 1 year ago

I have a plan on how to work around this in node-zwave-js, but I'm not sure it will work first try. Also it's entirely possible that the 7.19.3 firmware is bugged.

wljohnson05 commented 1 year ago

I have a plan on how to work around this in node-zwave-js, but I'm not sure it will work first try. Also it's entirely possible that the 7.19.3 firmware is bugged.

I'm up for testing anything I need to at this point since mine has been broken since at least the beginning of August.

It would have also had to have the same bug in older firmware as well. I had the issue on my original z-stick which was running a different version (honestly can't remember what the exact numbers were at this point) when this started. I upgrade the firmware on it at that point during some of my troubleshooting. Then when I first got this stick to replace it thinking maybe the stick itself was failing, it was running another older version as well. Then I upgraded it to the current version available, and still have the issue. There's a possibility that there's a firmware bug, but I hadn't upgraded the old z-stick any time in probably years when this issue first popped up for me.

I do have one question. Is it possible to back up the z-stick and restore to a different brand stick? Like if I picked up another 700 series stick, could I restore to that? That might be a last ditch effort if we can't find a fix. I'd rather not buy yet another stick, but if there is no software fix, I'd probably try more physical troubleshooting.

AlCalzone commented 1 year ago

Yeah. With zwave-js you can even go back to 500 series, as long as the SDK version of the target firmware is 6.61 or higher. Use NVM backup and restore in zwave-js ui.

bubbzy commented 1 year ago

@wljohnson05 I have the exakt same problem and the exact same setup and versions. Tried updating the stick to firmware 7.19.4 today but that didn't fix the problem...

Have you guys made any progress?

bubbzy commented 1 year ago

It seems to be a version 7.20.1.0 available but according to the dates that version was released before the one marked as "latest" wich is 7.18.8.0.

I don't get it? worth trying?

https://github.com/SiliconLabs/gecko_sdk/releases

wljohnson05 commented 1 year ago

It seems to be a version 7.20.1.0 available but according to the dates that version was released before the one marked as "latest" wich is 7.18.8.0.

I don't get it? worth trying?

https://github.com/SiliconLabs/gecko_sdk/releases

I just stuck with the newest guide on the aeotec site that I could find. That had the 7.19.3 on it. https://aeotec.freshdesk.com/support/solutions/articles/6000263744-update-z-stick-7-with-z-wavejs-ui

I don't mind trying the newer version on there though if something pops up with a higher version number. What's the worst to happen...it already doesn't work.

bubbzy commented 1 year ago

True but are we sure the problem is in the sticks firmware?

I would gladly buy another stick and backup and move if I knew i would work, is there a better stick out there without this problem?

Or is it just waiting för the Z-wave JS UI to update? Been havin trouble for several weeks now and really sick of it😃

wljohnson05 commented 1 year ago

True but are we sure the problem is in the sticks firmware?

I would gladly buy another stick and backup and move if I knew i would work, is there a better stick out there without this problem?

Or is it just waiting för the Z-wave JS UI to update? Been havin trouble for several weeks now and really sick of it😃

I don't think anyone knows. That was just a potential issue. Mine has been acting up since before August. I have z-wave for every switch in my house with tons of automation built into double/triple taps that is basically unusable now after an hour or so. WAF has basically gone to zero on that. I definitely feel your pain on it not working. My z-stick is actually a new one that I bought while doing my initial troubleshooting (same as my original stick). Yesterday I went ahead and ordered the zooz stick that I'm going to try as well. At least if I'm still having the issues, we'll have more evidence towards it being a software issue with the plugin instead of the stick. Then I'll just have to figure out how to use the extra ones somewhere else.

bubbzy commented 1 year ago

I have had it about the same but I didnt realize it because I rebuilt my z-wave network from the beginning with a new stick. I was having some problems with my old one but I walk right in to bigger trouble apparently...

Please tell me right away if the new stick works for you and if the backup/restore process from one stick to another works or gives you pain...

When the other issue similar to this got closed, is there no work being done to solve this?

wljohnson05 commented 1 year ago

I have had it about the same but I didnt realize it because I rebuilt my z-wave network from the beginning with a new stick. I was having some problems with my old one but I walk right in to bigger trouble apparently...

Please tell me right away if the new stick works for you and if the backup/restore process from one stick to another works or gives you pain...

When the other issue similar to this got closed, is there no work being done to solve this?

Will do. The only one I've been in any correspondence with is @AlCalzone, but I don't know if he's on the zwave js UI side or the zwave js side of development. He said it should be fine to restore to a different 700 series stick.

EDIT: Looks like AlCalzone is a zwave js developer, so hopefully he can find something.

bubbzy commented 1 year ago

Ok thanks! To get closer quicker, do you want me to test with yet another stick?

You went for the zooz, I dont really know wich are avalible except theese two?

Only positive with this problem is that my network have never been faster(when it works). All the troubleshooting have lead to perfecting everything...to bad it crashes all the time😂😂

wljohnson05 commented 1 year ago

Ok thanks! To get closer quicker, do you want me to test with yet another stick?

You went for the zooz, I dont really know wich are avalible except theese two?

Only positive with this problem is that my network have never been faster(when it works). All the troubleshooting have lead to perfecting everything...to bad it crashes all the time😂😂

I'd say probably just wait. If a different stick shows the same issue, I'd say that proves more that it's a plugin related issue and not the sticks. If the zooz stick works fine, then we'll know it's a firmware issue that aeotec will have to fix.

Looks like my zooz stick will be here Wednesday, so I'll be testing it then once it arrives.

wljohnson05 commented 1 year ago

zooz stick showed up today. I got my backup restored to it and everything has been up an running for about 80 minutes. We'll see how it is later tonight after it's been up and running for a few hours.

wljohnson05 commented 1 year ago

Well...300 minutes of home assistant being up, z-wave has stopped working again. I'm seeing the same jammed messages in the log, so I'd say we've proved that two different sticks are running into the same symptoms. @AlCalzone, did you find a fix in the code?

EDIT: Attached current log

zwave-js-ui-store.zip

wljohnson05 commented 1 year ago

With the last thing I can try, I've stood up a bare metal home assistant server and restored a backup to it. I'm going to run that for the day and see if I have the same issues.

AlCalzone commented 1 year ago

@wljohnson05 is this also on firmware 7.19.3? So far all reports of this that I've seen were on that version.

I was on vacation the last 7 days, will try to implement a workaround soon.

wljohnson05 commented 1 year ago

@wljohnson05 is this also on firmware 7.19.3? So far all reports of this that I've seen were on that version.

I was on vacation the last 7 days, will try to implement a workaround soon.

This stick is running 7.19.2. I looked, but I couldn't find a version of 7.19.3 for that stick. Closest thing I could see to the other is a file with "7.18.3" in the name. Not sure if that's a typo or what, so I didn't try to push that. https://www.support.getzooz.com/kb/article/931-how-to-perform-an-ota-firmware-update-on-your-zst10-700-z-wave-stick/

AlCalzone commented 1 year ago

This stick is running 7.19.2

Oh man, then the entire 7.19 release line is just fucked.

wljohnson05 commented 1 year ago

This stick is running 7.19.2

Oh man, then the entire 7.19 release line is just fucked.

Oh so you're thinking that was intentional, and that 7.19 has a major bug that's so bad that they are rolling back to 7.18?

AlCalzone commented 1 year ago

Not sure, but I don't think Zooz ever officially recommended 7.19

wljohnson05 commented 1 year ago

Not sure, but I don't think Zooz ever officially recommended 7.19

That's what they shipped it out with, so you wouldn't think it would be too bad, unless it's something they uncovered after putting out that supply of sticks.

wljohnson05 commented 1 year ago

@AlCalzone , My new physical HA instance I stood up is having the same issue now as well, so I'd say I'm almost positive it's a software issue...whether that be the plugin or the firmware used on these sticks, I don't know, but I've run several firmware versions in duration of this troubleshooting, so it could be that the entire 7.19 line is screwy for z-wave in general (I can't remember if I ran any 7.18 or not), or there's a bad bug in the z-wave js codebase that has popped up. Anything else I can do to test/help? I'm not much help on javascript, I work mainly in perl in python, so I can't help you with the zwave js stuff.

AlCalzone commented 1 year ago

I have implemented a possible workaround. Unfortunately it depends on changes I did for the upcoming 12.x release, which are too big to backport, so it might be 1-2 weeks until it comes into effect.

wljohnson05 commented 1 year ago

I have implemented a possible workaround. Unfortunately it depends on changes I did for the upcoming 12.x release, which are too big to backport, so it might be 1-2 weeks until it comes into effect.

2 weeks is better than never. Thanks for working on it.

So, does this pretty effectively fix the issue from what you can see?

AlCalzone commented 1 year ago

I tried to reproduce the behavior from the logs in a test case, but I don't know if the affected controllers actually behave the same (return to normal behavior after soft-reset), especially since the expected behavior after recovery isn't included in the logs. So there's a chance this doesn't work.

Hard to test without being able to reproduce reliably.

wljohnson05 commented 1 year ago

I tried to reproduce the behavior from the logs in a test case, but I don't know if the affected controllers actually behave the same (return to normal behavior after soft-reset), especially since the expected behavior after recovery isn't included in the logs. So there's a chance this doesn't work.

Hard to test without being able to reproduce reliably.

So it will essentially act the same way as if when mine stops working, I go in and do a soft reset on it? Do you need logs of that whole process of me doing that?

I guess the only downside to a soft reset is that it will take the z-wave network down for a minute or two while it loads everything back up, but I guess if it's already unresponsive at that point, it will at least get it going again.

Any idea what is causing it to lock up in the first place all of sudden recently? I mean my network worked fine before all this started. I hadn't updated my stick firmware for probably years. So I tend to think there was an update somewhere in either zwavejs or home assistant that caused this to start happening in the first place.

AlCalzone commented 1 year ago

So it will essentially act the same way as if when mine stops working, I go in and do a soft reset on it? Do you need logs of that whole process of me doing that?

Would be nice.

I guess the only downside to a soft reset is that it will take the z-wave network down for a minute or two

Normally, this should take in the order of a few seconds. Soft-reset just restarts the stick, not the entire stack. There's a command for this in Z-Wave JS UI in the advanced panel.

I hadn't updated my stick firmware for probably years

Unlikely, since 7.19 came out in December 22: https://github.com/SiliconLabs/gecko_sdk/releases/tag/v4.2.0 and the likely first release breaking this in March 23: https://github.com/SiliconLabs/gecko_sdk/releases/tag/v4.2.2

there was an update somewhere in either zwavejs [...]

I recently added detection for a jammed stick with retries. Before that, the node would just be marked dead when in reality the controller was the issue. What you're seeing now is an unfortunate infinite loop because transmitting is retried when the controller is considered jammed. The assumption here is that it eventually stops being jammed / is able to transmit again, but in this case the controller is just in a state it never recovers from on its own.

wljohnson05 commented 1 year ago

So it will essentially act the same way as if when mine stops working, I go in and do a soft reset on it? Do you need logs of that whole process of me doing that?

Would be nice.

I guess the only downside to a soft reset is that it will take the z-wave network down for a minute or two

Normally, this should take in the order of a few seconds. Soft-reset just restarts the stick, not the entire stack. There's a command for this in Z-Wave JS UI in the advanced panel.

I hadn't updated my stick firmware for probably years

Unlikely, since 7.19 came out in December 22: https://github.com/SiliconLabs/gecko_sdk/releases/tag/v4.2.0 and the likely first release breaking this in March 23: https://github.com/SiliconLabs/gecko_sdk/releases/tag/v4.2.2

there was an update somewhere in either zwavejs [...]

I recently added detection for a jammed stick with retries. Before that, the node would just be marked dead when in reality the controller was the issue. What you're seeing now is an unfortunate infinite loop because transmitting is retried when the controller is considered jammed. The assumption here is that it eventually stops being jammed / is able to transmit again, but in this case the controller is just in a state it never recovers from on its own.

I just got debug logging to a file set back up on it, and next time it acts up, I'll try just doing a soft reset from the GUI and see how that reacts and logs.

I hadn't updated that original stick I had until after I ran into issues. I honestly don't remember what it was on before that though. I hadn't had any reason to update it probably since I bought it when I moved to home assistant from homeseer running a z-net device (probably the best move I decided to do in the home automation world). That hasn't been years, but probably at least early last year I believe.

That makes sense. I had actually been running an automation that I found that kept track of any dead nodes and pings them if that number of devices changes, so I guess that's why I never really had an issue before.

robarnold commented 1 year ago

Should I be able to manually soft reset for now on 11.x when this occurs? Because that hasn't worked for me - I can leave the stick plugged in and restart ZUI to get it to start working again.

asayler commented 1 year ago

I recently updates my Zooz ZAC93 controller from FW 1.0 (SDK 7.18.?, I think) to 1.2 (SDK 7.19.3) and I belive I've begun to see this same issue. I opened #6300 before I saw this issue and have attached logs and other details there. Zooz recently released the 1.2 (7.19.3) firmware at https://www.support.getzooz.com/kb/article/1158-zooz-ota-firmware-files/, so I expect a number of folks will be upgrading and hitting this issue. I can try to roll back to 1.1 (7.18.3) to see if that fixes it.

AlCalzone commented 1 year ago

I can try to roll back to 1.1

Sadly, you can't.

AlCalzone commented 1 year ago

Should I be able to manually soft reset for now on 11.x when this occurs?

I'm not sure if Z-Wave JS actually performs the soft-reset in that case, since it is still busy trying to get the other command done.

wljohnson05 commented 1 year ago

@AlCalzone , I'm seeing it work with the soft reset using the UI. It takes a little bit of time for it to actually reset, but eventually it came back to working, and any commands I had put it, caught up. It should be near the bottom of the log since I just did the fix this morning. error zwave versions zwavejs_current.zip

I will say that it seems like the controller goes back to being unable to transmit pretty quickly though. When I restart the z-wave js UI plugin it usually works for several hours before seeing it have issues. The soft reset doesn't seem to clear that up as well (at least doing it manually).

AlCalzone commented 1 year ago

Thanks for the test. Looks like the soft-reset helped for about an hour, so the workaround will at least do something.

wljohnson05 commented 1 year ago

Thanks for the test. Looks like the soft-reset helped for about an hour, so the workaround will at least do something.

Yeah, and when the code is handling the resets, it might not even be noticeable that it's happening since that would catch it before a human would most likely.

txwindsurfer commented 1 year ago

Apologies if this isn't helpful but I am not having any of these issues on a 36-node network controlled by a Zooz ZST39 running firmware 1.20 SDK 7.19.3. HA 2023.9.2, Supervisor 10.5 on a Home Assistant Blue. Also, no problems on a 23-node network controlled by a Zooz ZAC93 (same firmware/SDK) on a Home Assistant Yellow. Both locations running Z-Wave-JS UI. Pretty much rock solid for over 2 months in both locations. My experience would indicate that this is not an issue with 7.19.3 on Zooz 800 sticks.

asayler commented 1 year ago

@txwindsurfer I'm using a Zooz ZAC93 (with is the serial version of your ZST39) with a 95-node network and the issue is definitely present. It started as soon as I upgraded to the 1.20 (7.19.3) firmware from the stock 1.0 firmware. So the issue does impact Zooz 800-series sticks. There's likely some other variable at play here (perhaps certain types of messages being sent across the network) that triggers the controller to enter this state, and that must not occur in your setup. But I don't think it's because the Zooz 800 gear is immune to the problem.

asayler commented 1 year ago

I don't suppose there's a way to trigger the soft reset from within HA that anyone knows of? I have a script the currently power cycles my HA Yellow when this starts to occur, but it would be faster to just soft reset the controller as a stop gap if there's a way to do that from within HA.

asayler commented 1 year ago

Also, has anyone reported the issue to Zooz?

millercentral commented 1 year ago

FWIW, I have this issue with an Aeotec Z-Stick 700 on 7.19.2, using Zwave JS addon in HA. The stick had been flashed to 7.19 last December and the network had been stable until early August when this issue was first noticed. I rolled back the HAOS Zwave JS addon to 0.186 (which I think uses ZWave JS 11.9.2) and haven't seemed to have the issue since.

AlCalzone commented 1 year ago

Also, has anyone reported the issue to Zooz?

Yep

madbrain76 commented 1 year ago

I'm having this problem on a ZST39 800LR with FW: v1.20, SDK: v7.19.3 . I can trigger the condition at will by trying to flash my most distant ZEN76 800LR switch. The flashing never completes, and the stick ends up in a bad state.

AlCalzone commented 1 year ago

See my comment here, Z-Wave JS v12 includes a workaround, HA and Z-Wave JS UI will pick this version up very soon.

The actual fix needs to happen in the controller firmware - I'm told Silabs are on it and have it as a top priority.

If this continues to be an issue after updating, please open a new issue.

jamiepenney commented 1 year ago

@AlCalzone not sure if this is the same issue I'm having? https://github.com/home-assistant/addons/issues/3230 I've got an Aotec Stick Gen5 though. Kinda feels like a borked firmware update to me but I'm running HA OS and ZWaveJS from the add-on store so I don't know how to get into the guts of it. Any idea how I check which version of the firmware I'm running?

wljohnson05 commented 1 year ago

@AlCalzone , sorry to report...This didn't fix the issue for me. About the same amount of time before everything just stops getting commands. I've attached the log where it includes where I upgraded zwave js ui earlier today up through a few minutes ago. Looking at the log, it does have times where it said it was jammed, and then no longer jammed. I'm not sure if that is your workaround doing it's job or what, but eventually every command becomes a fail. Even manually triggering a soft reset in the gui doesn't fix it. zwavejs_current.zip firefox_JOqWc9yOaR firefox_Wf0jwkYhvG