xoseperez / espurna

Home automation firmware for ESP8266-based devices
http://tinkerman.cat
GNU General Public License v3.0
2.99k stars 637 forks source link

All espurna nodes restarting at the same time #1236

Closed djwmarcx closed 3 years ago

djwmarcx commented 5 years ago

I recently started to deploy some espurna nodes (some sonoff and some nodemcus) around my house. 2 days ago I noticed one of them is restarting 2 or 3 times per day, without any pattern.

I updated the node to the latest version (1.13.3a) and saved the ELF file to parse stack traces returned by the crash command.

Just 2 hours ago, checking all nodes I realized that all of them (in the same network) are crashing at the same time, basically with the same issue.

Here is an example crash dump for a itead-sonoff-basic-ota. dump.txt

The dump is resolving to the following (incomplete??) stack trace which points to a function called ieee80211_parse_beacon as the root cause. `Exception Cause: 3 [LoadStoreError: Processor internal physical address or data error during load or store]

0x4023e36f: ieee80211_parse_beacon at ??:? 0x40000000: ?? ??:0 0x40100eb2: pp_post at ??:? 0x402412b7: scan_profile_check at ??:? 0x402416c9: scan_parse_beacon at ??:? 0x40241715: scan_parse_beacon at ??:? 0x40102196: wDev_ProcessFiq at ??:? 0x4023de19: ieee80211_parse_beacon at ??:? 0x40104ac0: ets_timer_arm_new at ??:? 0x40242474: ieee80211_parse_wmeparams at ??:? 0x40000f58: ?? ??:0 0x40241f3f: sta_input at ??:? 0x4022819f: pp_tx_idle_timeout at ??:? 0x40227acb: ppPeocessRxPktHdr at ??:? 0x40224c4f: loop_task at /home/marcx/.platformio/packages/framework-arduinoespressif8266@1.20300.1/cores/esp8266/core_esp8266_main.cpp:130 0x40000f49: ?? ??:0 0x40000f49: ?? ??:0 `

I guess that something weird is happening in my network and is causing all nodes to crash.

BTW, there's no problems in other network devices like computers or mobile phones.

mcspr commented 5 years ago

Hello.

I can't see any espurna-specific symbols in the log. Could it be just sdk related?

0x40224c4f: loop_task at /home/marcx/.platformio/packages/framework-arduinoespressif8266@1.20300.1/cores/esp8266/core_esp8266_main.cpp:130

newest one is core 2.4.2 / sdk 2.2.1, worth a try. platform is espressif8266@1.8.0

djwmarcx commented 5 years ago

One single node updated with latest platform.

Let's see...

djwmarcx commented 5 years ago

After switch to platform espressif@1.8.0 (2 of them), nodes are failing to get static IPs. In my router I can see that nodes are actually connected to the wifi, but without an IP. I managed to assign IPs to the nodes using router DHCP Lease and everything looks nice inside of it, but static IP definitely not working

Since this error is outside the scope of the issue, I'll switch to platform platform_173 and so on until I get any result.

djwmarcx commented 5 years ago

After experience the same IP assign problem with platform 1.7.3, I've returned to 1.8.0 but the erasing config. Seems now is working properly (at least the static IP problem).

So now it's time wait to see if updated nodes are working properly and without unexpected resets.

I'll be back soon with the results.

djwmarcx commented 5 years ago

Update: All my nodes including the updated ones are massively restarting at the same time yet.

Definitely my router is doing something weird. This monday I'll migrate it to a Mikrotik one. Let's see if the problem dissapears.

In the meanwhile, ideas?

BTW: The updated nodes are losing the settings also, which is worse :/

mcspr commented 5 years ago

I'd also try to record crash via serial connection, to bypass 'crash' recorder. Maybe there is something missing.

One recent change since 1.13.2 release is Fauxmo library, so I'd try to set 'ALEXA_ENABLED' define / 'alexaEnabled' setting to 0 to rule it out.

djwmarcx commented 5 years ago

Some of my crashing nodes already have ALEXA disabled already. So definitely is not the problem.

I need to record serial yet as you suggested.

I'll be back when done.

stale[bot] commented 5 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

dote78 commented 5 years ago

Same thing has been happening to me for a while now (I noticed, so maybe it was like that before). I have a Sonoff Touch and two Sonoff T1 EU 2-gang that reset simultaneously as well. They have been doing that for several firmware versions. Even now I have the Touch on 1.13.3 and the T1's on 1.13.0.

I also have 2 other sonoffs, these being TH, that also reset simultaneously, but on their own different schedule, apparently (one of these loses signal from time to time for being too far away so can't check right now).

I can make some tests, if anyone comes up with any idea...

mcspr commented 5 years ago

Looks like the similar problem as #1281, where the author pinpoints version 1.12.6 as last one working. I'd try that first on at least one device. Preferably with esptool.py erase_flash beforehand

djwmarcx commented 5 years ago

In my case, the problem seems have been disappeared after moving from my ISP router to a Mikrotik one. I guess my ISP router was doing something weird with the wifi that espurna or the esp8266 Arduino stack doesn't like too much. No updates were done between the router change.

mcspr commented 4 years ago

Per https://github.com/esp8266/Arduino/releases/tag/2.6.0 changelog, I'd expect this to be fixed when using that Core version and up. There are some new options for WiFi SDK (NONOSDK22x_191024, NONOSDK22x_191105 etc., see tools/platformio-build.py script used by the Core) and there were a lot of reports that those fix a lot of similar issues. One of those or both should work.

This is still a bug for us though.

mcspr commented 3 years ago

Closing via #2333