xoseperez / espurna

Home automation firmware for ESP8266-based devices
http://tinkerman.cat
GNU General Public License v3.0
2.99k stars 636 forks source link

S31 1.13.5 update - config lost on Web OTA, unstable #1595

Open AlphaDE opened 5 years ago

AlphaDE commented 5 years ago

Bug description After updating my S31 wall plug from the working 1.13.3 to 1.13.5 via web update, I needed to reconfigure the plug and it was not stable. During configuration when setting up the intended wifi, the plug fall back to "no configuration". After enabling web api and accessing the plug with Android based automation manager via REST api, the "on/off" switch of the web interface did not work anymore. The access via REST failed, too.

Steps to reproduce With 1.13.3 running, update with precompiled 1.13.5 to 1.13.5.

Expected behavior Keep configuration and work as before.

Device information [718580] [MAIN] ESPURNA 1.13.3 (e2641b9) [718581] [MAIN] xose.perez@gmail.com [718582] [MAIN] http://tinkerman.cat

[718583] [MAIN] CPU chip ID: 0xEAB2BF [718584] [MAIN] CPU frequency: 80 MHz [718585] [MAIN] SDK version: 1.5.3(aec24ac9) [718586] [MAIN] Core version: 2.3.0 [718587] [MAIN] Core revision: 159542381 [718588] [718589] [MAIN] Flash chip ID: 0x1640EF [718592] [MAIN] Flash speed: 40000000 Hz [718593] [MAIN] Flash mode: DOUT [718593] [718595] [MAIN] Flash size (CHIP) : 4194304 bytes / 1024 sectors ( 0 to 1023) [718597] [MAIN] Flash size (SDK) : 1048576 bytes / 256 sectors ( 0 to 255) [718598] [MAIN] Reserved : 4096 bytes / 1 sectors ( 0 to 0) [718599] [MAIN] Firmware size : 480064 bytes / 118 sectors ( 1 to 118) [718601] [MAIN] Max OTA size : 540672 bytes / 132 sectors ( 119 to 250) [718602] [MAIN] EEPROM size : 4096 bytes / 1 sectors ( 251 to 251) [718603] [MAIN] Reserved : 16384 bytes / 4 sectors ( 252 to 255) [718604] [718605] [MAIN] EEPROM sectors: 251, 250 [718606] [MAIN] EEPROM current: 250 [718607] [718608] [MAIN] EEPROM: 4096 bytes initially | 1218 bytes used (29%) | 2878 bytes free (70%) [718610] [MAIN] Heap : 35760 bytes initially | 20104 bytes used (56%) | 15656 bytes free (43%) [718612] [MAIN] Stack : 4096 bytes initially | 1136 bytes used (27%) | 2960 bytes free (72%) [718613] [718614] [MAIN] Boot version: 31 [718615] [MAIN] Boot mode: 1 [718616] [MAIN] Last reset reason: Reboot from web interface [718617] [718619] [MAIN] Board: ITEAD_SONOFF_S31 [718620] [MAIN] Support: ALEXA API BROKER BUTTON DEBUG_TELNET DEBUG_WEB DOMOTICZ HOMEASSISTANT LED MDNS_SERVER MQTT NTP SCHEDULER SENSOR TELNET TERMINAL THINGSPEAK WEB [718624] [MAIN] Sensors: CSE7766 [718626] [MAIN] WebUI image: SENSOR [718627] [718628] [MAIN] Firmware MD5: 428b2ff1ab903548d7179893108fcc1f [718630] [MAIN] Power: 3182 mV [718631] [MAIN] Power saving delay value: 10 ms [718632]

Tools used Win 10 Pro. Firefox 65.0

mcspr commented 5 years ago

Just to recap: after update to the 1.13.5 it what way plug was unstable? did you have mqtt enabled? can you also try 1.13.5 binary again and set loopDelay 10 at the Debug panel terminal and then reboot the device.

AlphaDE commented 5 years ago

After web update from 1.13.3 to 1.13.5 with the binary from the website, the complete configuration was lost and I had to reconfigure the S31. During the reconfigure, while setting up the wifi, the complete configuration was again lost and I had to do that again. After finishing the configuration and enabling REST API access for Android based Automation Manager (which I use to control all my devices), after AM tried to access the S31, that access failed and the device did not react to the on/off switch in the device web interface anymore. That means, the switch on the web page changed the status, but the device remained unchanged.

After reverting to 1.13.3, that problem disappeared.

MQTT is not enabled.

I try tomorrow with my spare device.

AlphaDE commented 5 years ago

I just tried it again with my spare adapter... after uploading the 1.13.5 firmware, the device came up with default configuration and in AP mode with default password. After reconfiguring the S31 with a new admin password and the credentials for my IoT wifi, it connects. Trying to access it from the browser leads to Firefox reporting that a file is to be downloaded. After disconnecting the S31 from power and reconnecting it, the web interface was accessible.

The initial registration with Automation Manager did work (the spare device was not registered there), but the configuration loss should in my eyes not happen, as this causes a resetup of all my 10 S31 to get them working again.

A restore attempt of the saved setup did not restore the hostname.

---8<-------

[093863] [MAIN] ESPURNA 1.13.5 (83195fd) [093864] [MAIN] xose.perez@gmail.com [093867] [MAIN] http://tinkerman.cat

[093868] [MAIN] CPU chip ID: 0xCD74D7 [093869] [MAIN] CPU frequency: 80 MHz [093870] [MAIN] SDK version: 1.5.3(aec24ac9) [093871] [MAIN] Core version: 2.3.0 [093872] [MAIN] Core revision: 159542381 [093873] [093874] [MAIN] Flash chip ID: 0x1640EF [093879] [MAIN] Flash speed: 40000000 Hz [093881] [MAIN] Flash mode: DOUT [093884] [093885] [MAIN] Flash size (CHIP) : 4194304 bytes / 1024 sectors ( 0 to 1023) [093888] [MAIN] Flash size (SDK) : 1048576 bytes / 256 sectors ( 0 to 255) [093889] [MAIN] Reserved : 4096 bytes / 1 sectors ( 0 to 0) [093893] [MAIN] Firmware size : 490480 bytes / 120 sectors ( 1 to 120) [093895] [MAIN] Max OTA size : 532480 bytes / 130 sectors ( 121 to 250) [093898] [MAIN] EEPROM size : 4096 bytes / 1 sectors ( 251 to 251) [093900] [MAIN] Reserved : 16384 bytes / 4 sectors ( 252 to 255) [093903] [093905] [MAIN] EEPROM sectors: 251, 250 [093908] [MAIN] EEPROM current: 251 [093909] [093912] [MAIN] EEPROM: 4096 bytes initially | 1216 bytes used (29%) | 2880 bytes free (70%) [093917] [MAIN] Heap : 34912 bytes initially | 19344 bytes used (55%) | 15568 bytes free (44%) [093918] [MAIN] Stack : 4096 bytes initially | 1216 bytes used (29%) | 2880 bytes free (70%) [093922] [093924] [MAIN] Boot version: 31 [093925] [MAIN] Boot mode: 1 [093926] [MAIN] Last reset reason: Reboot from web interface [093927] [093930] [MAIN] Board: ITEAD_SONOFF_S31 [093932] [MAIN] Support: ALEXA API BROKER BUTTON DEBUG_TELNET DEBUG_WEB DOMOTICZ HOMEASSISTANT LED MDNS_SERVER MQTT NTP SCHEDULER SENSOR TELNET TERMINAL THINGSPEAK WEB [093935] [MAIN] Sensors: CSE7766 [093937] [MAIN] WebUI image: SENSOR [093940] [093941] [MAIN] Firmware MD5: db66b90b519bbea28195997c33fd3896 [093943] [MAIN] Power: 3162 mV [093944] [MAIN] Power saving delay value: 10 ms [093945]

---8<-------

AlphaDE commented 5 years ago

After

the web interface was blank and the device did not react anymore. A double power cycle brought the device back, with power back to J instead of kWh.

A second attempt on that (change to kWh, save, refresh status page) though worked.

lucaberta commented 5 years ago

I have been experiencing very similar problems in updating different products from 1.13.3 to 1.13.5.

In all cases, the SoftAP came up again after a minute or so, sometimes keeping the old device name and host password as SSID and wifi password, other times reverting completely to the default ESPURNA-MACADDR hostname and fibonacci password.

Moreover, devices which have been working flawlessly on a specific access point, now don't seem to be able to connect any longer, and by turning on a different access point the devices can regain network connectivity.

There are thus different issues which combine with one another, as SoftAP becomes active if no wifi connecting gets established, if I am not mistaken.

Still, why would the device lose completely the configuration sometimes, and other times not, that is not clear.

I am willing to provide further debug if needed.

mcspr commented 5 years ago

When running 1.13.5, does the device reboot when saving or webui simply disconnects so nothing can be saved? Notification led will briefly blink on reboot. Connectivity loss can explain some of the weird behaviour.

Config loss can be explained by bug in 1.13.3, when it reboots after writing new firmware. Sorry about that :/ Should not happen with 1.13.5 (https://github.com/xoseperez/espurna/compare/1.13.3...89ed0ca)

WiFi connectivity problems were discussed in #1574 . loopDelay value was reduced to 0 in 1.13.4 pre-built .bins and that caused biiig problems for certain devices, so now it is set to 1. Which is why I mentioned changing it back to 10 (old 1.13.3 value) by typing set loopDelay 10 and reset in the Debug panel, to see if restoring it helps.

edit: And I think Firefox may have some issues when changing networks, will look at that. edit2: Home page status updates are delayed, but that is not related.

lucaberta commented 5 years ago

@mcspr I skipped 1.13.4 altogether, and all my issues were on 1.13.3 to 1.13.5 upgrades, on multiple devices of different kinds.

Not sure then if the loopDelay issue is still there in the 1.13.5 builds or not, it doesn't appear so as @xoseperez seem to have pulled all the buggy 1.13.4 builds, but still I have been facing issues with the newer 1.13.5 and the wifi code, so I am not sure where the problem lies.

After recovering the device on 1.13.5, I was always able to have the web interface work fine, but it's just impossible to connect to the normal wifi access point I had been using with 1.13.3.

lucaberta commented 5 years ago

@mcspr FYI I am using Chrome on macOS to perform the upgrades, sometimes Chromium on Linux.

AlphaDE commented 5 years ago

@mcspr Any way to prevent that config loss? Reconfiguring a bunch of devices, where also the restore of the config backup fails and where the specific HTTP API Key is stored by Automation Manager to control the devices is not that funny...

mcspr commented 5 years ago

@lucaberta yes, but it is still a different from the old one, so may be worth checking. ofc maybe something else has changed the timing of things, but not much changed in the loop code.

@AlphaDE if possible, i'd try to powercycle device first (not sw reboot) and try using espota.py method instead of browser

AlphaDE commented 5 years ago

@mcspr Will try... why isn't it possible to restore a 1.13.3 configuration backup in 1.13.5?

lucaberta commented 5 years ago

@mcspr I am doing a few rollback tests to 1.13.3 and things are even more bizarre.

On a POW R2 I did an espota.py rollback to 1.13.3 and lost control of the device, which came up again shortly as AP with device name and admin password, so I could connect to it via the web interface again.

I then removed the wifi network and reconfigured them putting the first network on the AP which stopped working on 1.13.5 and the second one as backup, the one which always worked also with 1.13.5.

After reboot, I did not see the IP address up, so I connected again via the AP to 192.168.4.1 and lo and behold it looks like the wifi connection to the first wifi AP went fine, but the IP address now shows 0.0.0.0 and soft AP is clearly still working as that is how I connected to the configuration GUI.

This device had the same exact configuration on both releases, and I did NOT put the loopDelay value to 10 on it.

The plot thickens...

lucaberta commented 5 years ago

@mcspr after further testing with 1.13.3 rolled back from tests with 1.13.5, the units which previously could connect to the first wifi AP fine, and with 1.13.5 they stopped working on it, unfortunately STILL cannot connect to the original AP even with 1.13.3.

Not sure if anything has happened in any configuration parameter, but it looks like just a few devices now cannot successfully connect to the old and reliable wifi AP I have here.

Switching on and off the alternate wifi AP allowed me to stop traffic to the devices showing the falut, and to restore it back. In spite of the AP order on the wifi configuration, the first AP never worked, while the connections seem to work fine on the second one.

This, again, is verified also with 1.13.3 rolled back after tests with 1.13.5.

mcspr commented 5 years ago

@AlphaDE Maybe something is wrong with the backup contents? i.e. not closed brackets, last kv has comma, no commas between keys - config should be valid json. if it is, only requirement to restore is a must-have "app": "ESPURNA" key-value.

@lucaberta can you try running erase.config command? that is for SDK configuration, not espurna's. note: technically, it should be equivalent to the "erase wifi settings" in arduino ide. it is slightly incorrect about what to erase, but i think it still zeroes out the part that 1.13.3 crash might've corrupted

lucaberta commented 5 years ago

@mcspr the erase.config trick worked on two of the three devices where I have been experiencing the issues with returning to the standard wifi AP.

On the Sonoff TH, I was able to reconfigure the normal AP, while on the Sonoff POW R2 it did not work, and once again I went into soft AP, tried again with erase.config with no luck, and ended up using again the alternate AP I have.

All the tests were done using 1.13.3.

mcspr commented 5 years ago

@lucaberta I wonder if Arduino Core patch I linked can help, so it completely erases that part of the flash. You can try building with it applied via platformio (it will be in ~/.platformio/packages/framework... dir).

Or if you don't mind running random binaries from the internet, I can build it when I get back to the computer.

lucaberta commented 5 years ago

@mcspr I too will be away from my computers until tomorrow, so no worries, I can do further tests with any code you want me to test.

Thanks, Luca On 5 Mar 2019, 17:40 +0100, Max Prokhorov notifications@github.com, wrote:

@lucaberta I wonder if Arduino Core patch I linked can help, so it completely erases that part of the flash. You can try building with it applied via platformio (it will be in ~/.platformio/packages/framework... dir).

Or if you don't mind running random binaries from the internet, I can build it when I get back to the computer. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

AlphaDE commented 5 years ago

@mcspr

Here is that backup... looks ok... (just invalidated sensitive data). No clue why it does not restore.

{ "app": "ESPURNA", "version": "1.13.3", "backup": "1", "timestamp": "2019-03-03 13:26:25", "relayBoot0": "0", "relayPulse0": "0", "cfg": "3", "adminPass": "XXXXXXXXXXXXX", "btnDelay": "500", "ledMode0": "1", "alexaEnabled": "1", "relaySync": "0", "relayTime0": "0", "mqttGroupInv0": "0", "relayOnDisc0": "0", "wsAuth": "1", "apiEnabled": "1", "apiRestFul": "1", "apiRealTime": "0", "apiKey": "XXXXXXXXXXXXXXXX", "telnetSTA": "0", "wifiScan": "1", "ssid0": "XXXXXX", "pass0": "XXXXXXXXXXXX", "mask0": "255.255.255.0", "dns0": "8.8.8.8", "mqttEnabled": "0", "mqttPort": "1883", "mqttQoS": "0", "mqttRetain": "1", "mqttKeep": "300", "mqttUseSSL": "0", "mqttTopic": "{hostname}", "mqttUseJson": "0", "ntpServer": "pool.ntp.org", "ntpOffset": "60", "ntpDST": "1", "ntpRegion": "0", "dczEnabled": "0", "dczTopicIn": "domoticz/in", "dczTopicOut": "domoticz/out", "dczRelayIdx0": "0", "dczMagnitude0": "0", "dczMagnitude1": "0", "dczMagnitude2": "0", "dczMagnitude3": "0", "dczMagnitude4": "0", "dczMagnitude5": "0", "haEnabled": "0", "haPrefix": "homeassistant", "tspkEnabled": "0", "tspkRelay0": "0", "tspkMagnitude0": "0", "tspkMagnitude1": "0", "tspkMagnitude2": "0", "tspkMagnitude3": "0", "tspkMagnitude4": "0", "tspkMagnitude5": "0", "snsRead": "6", "snsReport": "10", "snsSave": "0", "pwrUnits": "0", "eneUnits": "0", "tmpUnits": "0", "tmpCorrection": "0", "humCorrection": "0", "snsResetTS": "2019-02-01 15:25:47", "hostname": "XXXXXXXX", "boardName": "ITEAD_SONOFF_S31" }

AlphaDE commented 5 years ago

@mcspr I tried it with a power cycle after OTA, via web and with espota.py, in both cases I ended up with a complete configuration loss and the device in AP mode with default password requiring complete reconfiguration.

So far, there seems to be NO WAY to upgrade from 1.13.3 to a newer firmware without completely reconfiguring the device. There is a lot seriously broken.

Is there any way to pass the configuration backup to the device together with the firmware update to avoid this?

AlphaDE commented 5 years ago

@mcspr And the config restore in 1.13.5 seems broken too, I managed to restore only 2 of 4 devices although the 1.13.3 created JSON backup is correct.

mcspr commented 5 years ago

@AlphaDE I hope that is a typo, because I meant power cycle before doing any OTA :)

Today's test is with newly flashed Basic + Firefox, so that may be bad comparison, - I can successfully upgrade from 1.13.3 to 1.13.5 and backup / restore config afterwards. Tried both of these when connected to the device directly and when it is connected to AP. Some small issues with direct connection, mostly related to response time, but everything else seemed fine

c/p json from https://github.com/xoseperez/espurna/issues/1595#issuecomment-469877058 works for me when i try to upload it Does it say why configuration upload is failing? What does the Debug log say? Can you try uploading it via separate HTTP endpoint using terminal app?

# cd into the directory with backup.json
$ curl -v -uadmin:<password> http://<device-address>/config -F "config=@backup.json"
AlphaDE commented 5 years ago

@mcspr

Oh, I misunderstood that... tried the power cycle after the OTA.

But anyway, after EVERY OTA, with .py or from web, the config is gone.

Restore of JSON config (backuped just before the OTA) works in most, but not all cases, no errors, it reports success but nothing is changed. Same for restoring attempts with curl on Powershell (being on Windows and not having a Linux here right now makes life somewhat difficult).

For debug, I look when I update the next, but after 6 devices and over an hour, I am little fed up.

lucaberta commented 5 years ago

@mcspr I have done a few tests with the POW-R2 special builds you created a day or two ago, and while finally the wifi access point issue has been fixed, there are still issues in upgrading from the two builds, exactly like @AlphaDE has been finding in this same thread.

While everything was fine the 1.13.3 special build, I tried upgrading to the 1.13.5 special build, and again the configuration was lost, and I was back to default settings. Moreover, somehow the admin password change window would not work properly, and I lost contact with the device, and had to flash it again using esptool.py.

The amount of time wasted on this botched release has been a nightmare...

mcspr commented 5 years ago

I share your frustration, but at least now we don't have 3 problems at once.

Does the configuration loss occur when upgrading or when already running 1.13.5? Another question is, how come upgrading to my binaries did not cause it? crash in the terminal shows the last reset reason and config reset will surely coincide with huuuge stack trace (which is overwriting configuration, sdk config etc.). Guard for this is already included in 1.13.5, so it should not overwrite configuration that way, but there still must be something missing if configuration loss happens anyways.

BTW, all these tests are done connected directly the device via it's softap? And does this softap come up on it's own or are you manually pressing the button twice?

AlphaDE commented 5 years ago

@mcspr At least for me, the configuration loss only occured on the upgrade, the upgraded devices (S31) run fine.

After upgrade, softap comes up, devices is reset to standard password.

Which binaries do you mean with "yours", is there a second set for download anywhere (for the S31), then I could try that.

As I have one S31 spare here, I can test back and forth, if you tell me what exactly to look for to help to determine the cause.

mcspr commented 5 years ago

@AlphaDE https://github.com/xoseperez/espurna/pull/1616#issuecomment-470608527, but that is only for erase.config command and only for pow-r2. I prepared a similar ones, based on https://github.com/xoseperez/espurna/compare/dev...mcspr:test/crash-dummy-3 (1.13.3) and https://github.com/xoseperez/espurna/compare/dev...mcspr:test/crash-dummy-5 (1.13.5)

espurna-dev.zip I also included the .elf files. They are only for debug and must not be flashed.

Summary of what's different from the base 1.13.3 / 1.13.5:

As I have one S31 spare here, I can test back and forth, if you tell me what exactly to look for to help to determine the cause.

Thank you! The goal here is to see where exactly is crash originates, by accessing http://<device>/crash or running crash command in the terminal after it happens.

One thing that I have not thought about earlier is Alexa support, so maybe disabling would help (terminal set alexaSupport 0 & reset)

AlphaDE commented 5 years ago

Alexa is disabled as Automation Manager is the central hub handling Alexa and more stable than letting that do the units themself (and shields them, too). Is that terminal command on Alexa the same as switching it off in the UI or is that on lower level?

So I should flash the 1.13.4, set it up and then try the upgrade to 1.13.6 (both dev) to see what happens?

mcspr commented 5 years ago

Terminal set is just a shortcut to apply a single key=value setting + reset the board. Web does the same thing (also note that it is the same key in the config file). I noticed that the config you posted earlier has Alexa support enabled.

So I should flash the 1.13.4, set it up and then try the upgrade to 1.13.6 (both dev) to see what happens?

Yes, checking both config safety and whether it crashes at all

AlphaDE commented 5 years ago

Ah, yes, one was left (I forgot to switch that off after reconfiguring after the failed upgrade to 1.13.5).

Downgrade from 1.13.5 to 1.13.4 dev worked fine, rather quick. Upgrade from there to 1.13.6 dev took the device longer to get back, I assume it rebooted twice. Both times no config loss, no crash info.

Direct upgrade from 1.13.3 to 1.13.6 dev full configuration loss.

Crossgrade 1.13.3 to 1.13.4 dev full configuration loss (and this due to re-setup no chance to get the crash info, /crash showed an empty screen)

mcspr commented 5 years ago

@AlphaDE bummer. i really hoped it would store something when upgrading from 1.13.3... thank you anyways!

btw have you ran crash command before checking /crash page contents? i noticed just now that i left out the code that marks it as read, so doing things in that order would render /crash empty, even if it is not really.

AlphaDE commented 5 years ago

@mcspr

After updating from 1.13.3 to 1.13.6 dev with full reset, I tried to access /crash (as trying to access the UI needs setting a new admin password with reconnect to AP and login). I could access /crash with the standard password, but got an empty page.

If there is anything else I can try, let me know...

mcspr commented 5 years ago

By default Core dumps crash log to serial, which can be accessed de-powering POW from the mains and attaching it to the 3v3 adapter and serial adapter. Unless it is a hardware watchdog reset, I would expect some crash text there.

Which got me thinking - @xoseperez comments mention that we should avoid triggering hw watchdog at custom_crash_callback, that is handling writing it to the flash. Both devices mentioned here, POW-R2 and S31, have a rather slow interface speed of 4800, because CSE7766 is using hw uart. Could it somehow trip there? At least that would explain why there is no crash log but the flash is erased anyways.

Because of the bug 1.13.3 reset can happen at any point. Best case is when system has finished loop(), worst case - when loop() is doing something important and we yield()ing into the system reboot timer.

lucaberta commented 5 years ago

@mcspr I had the biggest problems with a POW-R2 and given that the UART is taken by the connection with the current measuring chip, all these image changes have been a royal pain as I would normally do the basic configuration straight from the serial console giving commands to configure the adminPassword and wifi.

No can do on the POW-R2, and that meant an amazing waste of time.

Another device on which I had issues was a modified Sonoff RF Bridge, with which I also lost wifi connection, but using the trick of the different access point I was able to reconnect it to the local LAN.

Then, for some reason, it worked again on the normal wifi access point, without any apparent reason.

I did not want to touch it either, but it's running 1.13.5 without issues now.

Another device on which I had issues was a TH, and I have used the serial console a lot with it.

Happy to learn that the multiple issues have been fixed in the code, and I look forward to the new official build. But I will be very wary this time, and will only upgrade non-critical devices first!

AlphaDE commented 5 years ago

Some additional information... I see sudden reboots and device timeouts while running 1.13.5 and the 1.13.6-dev on my S31 devices.

Mainly I see that AutomationManager (Android app) reports timeouts and devices are not on the log, checking this a few minutes ago trying to access the device in question, firefox reported to get an octet stream and after I could finally access the device some minutes later, I saw an uptime of just 8 minutes. So the device rebooted (although I did not do that manually).

mcspr commented 5 years ago

Does the /crash thing show anything of value this time? On the 1.13.6-dev device

AlphaDE commented 5 years ago

[DEBUG] Latest crash was at 182957440 ms after boot [DEBUG] Reason of restart: 2 [DEBUG] Exception cause: 29 [DEBUG] epc1=0x40222ea5 epc2=0x00000000 epc3=0x00000000 [DEBUG] excvaddr=0x00000000 depc=0x00000000 [DEBUG] sp=0x3ffffe50 end=0x3fffffb0 [DEBUG] >>>stack>>> [DEBUG] 3ffffe50: 00000000 98f46194 40102196 00000100 [DEBUG] 3ffffe60: 7fffffff 3ffeb770 3ffeb770 3fff2840 [DEBUG] 3ffffe70: 3fff519c 3fff519c 3fffbef4 401004d8 [DEBUG] 3ffffe80: 00000000 98f46194 00002200 3fff2840 [DEBUG] 3ffffe90: 3fff519c 00000000 3fffbef4 40224aed [DEBUG] 3ffffea0: 3fffbef4 3fff519c 3fffbd64 4021a44e [DEBUG] 3ffffeb0: 3fff2a2c 3fff2a38 3fffbd64 4021a6c4 [DEBUG] 3ffffec0: 3fffbd64 3fff2848 3fff2844 4021a6f7 [DEBUG] 3ffffed0: 3fff2858 3fff6d2c 3fff2a38 4024323d [DEBUG] 3ffffee0: 00000000 9805a8c0 00000010 00000000 [DEBUG] 3ffffef0: 00000010 00000000 3fff1134 401075a0 [DEBUG] 3fffff00: 00000000 3ffed779 3ffed795 3fff2a30 [DEBUG] 3fffff10: 3ffeb7f6 3fff2a38 3fff6d2c 402419ed [DEBUG] 3fffff20: 3fff273c 3fff6a54 3fff6a54 3ffeff40 [DEBUG] 3fffff30: 00000000 3fff6d2c 0000001c 3fff6a54 [DEBUG] 3fffff40: 3ffeb7e8 00000000 3fff6d2c 40240de5 [DEBUG] 3fffff50: 6605a8c0 0000005e 00000000 00000033 [DEBUG] 3fffff60: 00000002 0000001a 402305d3 3ffee958 [DEBUG] 3fffff70: 3ffeb7c0 3fffdcc0 3ffeafb0 3ffeafb0 [DEBUG] 3fffff80: 40230546 3ffee958 00000000 3fff72ac [DEBUG] 3fffff90: 3fffdc80 00000000 3fff6d2c 40238a63 [DEBUG] 3fffffa0: 40000f49 3fffdab0 3fffdab0 40000f49 [DEBUG] <<<stack<<<

mcspr commented 5 years ago

using decoder.py:

$ ./decoder.py -p ESP8266 -e firmware.elf -s trace.txt
stack:
0x40102196: wDev_ProcessFiq at ??:?
0x401004d8: malloc at /home/worker/.platformio/packages/framework-arduinoespressif8266/cores/esp8266/umm_malloc/umm_malloc.c:1664
0x40224aed: operator() at /builder/.piolibdeps/ESP Async WebServer/src/WebServer.cpp:79
  \-> inlined by: _M_invoke at /home/worker/.platformio/packages/toolchain-xtensa/xtensa-lx106-elf/include/c++/4.8.2/functional:2071
0x4021a44e: std::function<void (void*, AsyncClient*)>::operator()(void*, AsyncClient*) const at /builder/.piolibdeps/ESPAsyncTCP/src/ESPAsyncTCP.cpp:665
0x4021a6c4: AsyncServer::_accept(tcp_pcb*, signed char) at /builder/.piolibdeps/ESPAsyncTCP/src/ESPAsyncTCP.cpp:665
0x4021a6f7: AsyncServer::_s_accept(void*, tcp_pcb*, signed char) at /builder/.piolibdeps/ESPAsyncTCP/src/ESPAsyncTCP.cpp:665
0x4024323d: tcp_process at /Users/igrokhotkov/espressif/arduino/tools/sdk/lwip/src/core/tcp_in.c:755
  \-> inlined by: tcp_input at /Users/igrokhotkov/espressif/arduino/tools/sdk/lwip/src/core/tcp_in.c:340
0x401075a0: pvPortMalloc at /home/worker/.platformio/packages/framework-arduinoespressif8266/cores/esp8266/heap.c:13
0x402419ed: ip_input at /Users/igrokhotkov/espressif/arduino/tools/sdk/lwip/src/core/ipv4/ip.c:559
0x40240de5: ethernet_input at /Users/igrokhotkov/espressif/arduino/tools/sdk/lwip/src/netif/etharp.c:1379
0x402305d3: pp_tx_idle_timeout at ??:?
0x40230546: pp_tx_idle_timeout at ??:?
0x40238a63: system_get_sdk_version at ??:?

Does not explain much though. Something in the HTTP server blows up the memory. (maybe related to the https://github.com/xoseperez/espurna/issues/1626, or the same cause as for the #1660)

AlphaDE commented 5 years ago

Further devices

[268566] [DEBUG] Latest crash was at 6187295 ms after boot [268567] [DEBUG] Reason of restart: 2 [268568] [DEBUG] Exception cause: 3 [268569] [DEBUG] epc1=0x4010011d epc2=0x00000000 epc3=0x00000000 [268570] [DEBUG] excvaddr=0x400361b8 depc=0x00000000 [268571] [DEBUG] >>>stack>>> [DEBUG] 3fff1ee0: 00000000 3fff1f70 402349bd 3ffef4a0 [DEBUG] 3fff1ef0: 3fff2134 00000703 00000703 4010020c [DEBUG] 3fff1f00: 3fff0c38 00000000 3fff65c4 4010068c [DEBUG] 3fff1f10: 3fff1f50 00000000 3fff0c38 401075b0 [DEBUG] 3fff1f20: 402340d2 3fff65c4 3fff65c4 40236643 [DEBUG] 3fff1f30: 40234f0a 3fff65c4 3fff65c4 40229f18 [DEBUG] 3fff1f40: 40216aa5 3fff0d58 00000000 40228e66 [DEBUG] 3fff1f50: 3fff0b00 3fff1fa0 00000000 40203efe [DEBUG] 3fff1f60: 3fff2134 000005ab 00000064 40224f61

[614032] [DEBUG] Latest crash was at 504055393 ms after boot [614033] [DEBUG] Reason of restart: 2 [614036] [DEBUG] Exception cause: 29 [614038] [DEBUG] epc1=0x4022c01f epc2=0x00000000 epc3=0x00000000 [614039] [DEBUG] excvaddr=0x00000000 depc=0x00000000 [614041] [DEBUG] sp=0x3ffffd80 end=0x3fffffb0 [614043] [DEBUG] >>>stack>>> [DEBUG] 3ffffd80: 3ffe9ac5 0000000f 3ffffdc4 4022ca1c [DEBUG] 3ffffd90: 3fffbcb8 00000000 3fffbc94 40221a6a [DEBUG] 3ffffda0: 00000000 00000000 00000000 00000000
AlphaDE commented 5 years ago

and while trying to read out the crash on the second one, another crash,

[070252] [WEBSOCKET] Requested action: dbgcmd [070255] [DEBUG] Latest crash was at 3617964 ms after boot [070257] [DEBUG] Reason of restart: 2 [070258] [DEBUG] Exception cause: 29 [070261] [DEBUG] epc1=0x40220a1e epc2=0x00000000 epc3=0x00000000 [070262] [DEBUG] excvaddr=0x00000000 depc=0x00000000 [070265] [DEBUG] sp=0x3fff1f60 end=0x3fff2160 [070266] [DEBUG] >>>stack>>> [DEBUG] 3fff1f60: 00000022 3fff7210 3fff1f90 40220a1c [DEBUG] 3fff1f70: 00000000 3fff746c 3fffbfcc 40220b32 [DEBUG] 3fff1f80: 3fffbf7c 3fff0794 3fffbf7c 4020dfc2 [DEBUG] 3fff1f90: 3fff1f98 3fffbf7c 3fff75bc 40220b6b [DEBUG] 3fff1fa0: 00000022 3fff0794 3fffbf7c 4020dffe [DEBUG] 3fff1fb0: 3fff71c4 3fff7218 3ffe8ea4 3fff71c4 [DEBUG] 3fff1fc0: 3fff7210 3fff2010 00000001 4020e360 [DEBUG] 3fff1fd0: 3ffe9450 3fff71c4

That device is on 1.13.5 and the crash while reading out happened again...

Is it possible that the webserver/REST api server has problems with many or parallel requests and crashes?

1.13.6-dev installed, this time the config survided, perhaps due to the reboot just before.

mcspr commented 5 years ago

Is it possible that the webserver/REST api server has problems with many or parallel requests and crashes?

I noticed that captive mode leaks memory (resulting in OOM error and crash) because of Not Found responses (#1660), but was not able to reproduce it when ESP is connected to an AP. Stack dumps are not specific enough, so that's my only guess for now that it may be connected to how HTTP server handles some of the connections and probably leaves connections open, so you crash immediately when accessing WebUI.

AlphaDE commented 5 years ago

As the controlling AM app generates quite a number of frequent requests, any way to help you to track this down?

mcspr commented 5 years ago

I'll try to debug with a live device again, maybe something will stand out (to avoid just guessing...)

And will also try to verify if it has something to do with network connectivity problems. I was testing a Basic some time ago for WiFi issues, and now I see free heap is rapidly decreasing at random. It returns after some time, jumping from 6k to 18k and back again. But I can't exactly understand the cause.

arihantdaga commented 5 years ago

I too faced similar issues and device lost configuration after update via Nofuss. I enabled debug in eeprom rotate and saw this -

Magic value for sector #1019 is 71
Calculated CRC: 0x8ADE
Stored CRC    : 0x8ADE
Magic value for sector #1018 is 12
Calculated CRC: 0x3A78
Stored CRC    : 0x40B9
Sector #1018 has not passed the CRC check
Magic value for sector #1017 is 0
Calculated CRC: 0x2DCD
Stored CRC    : 0x765F
Sector #1017 has not passed the CRC check
Magic value for sector #1016 is 161
Calculated CRC: 0xA924
Stored CRC    : 0xFFA0
Sector #1016 has not passed the CRC check
Current sector is #1019
Current magic value is #71

Clearly blocks 1016,1017 and 1018 are overwritten. While 1019 is intact so EEPROM Rotate picked up configuration from 1019, but since the latest data was not present in the sector 1019 . So the device lost some configuration keys.

But once again question is, why are these sectors are being overwritten during OTA. For information - I did a custom build with no SPIFSS enabled(In both previous and new binary).

I also disabled calling of custom_crash_callback as well, via an addition flag
#define SAVE_CRASH_INFO 0.

Is this the same problem or a different one ? Can someone help me figure out this one ?

mcspr commented 5 years ago

@arihantdaga Do you mean that you have if SAVE_CRASH_INFO around crash callback declaration in your code? https://github.com/xoseperez/espurna/commit/08a748e05d356209963aadb2768d19e0a40d9713#diff-28aabe0a6f392137d6168ba1d2d67f8aR68 I added a small patch checking this at runtime, since crash handler is disabled anyways with DEBUG_SUPPORT=0. You can also disable just the stack trace but keep the reset info

You sure this is related and not a NoFUSS problem (i.e. needs a separate issues)? Do you mind checking serial dumps if you don't use POW / S31? (note: they also don't produce them anyways when building with platformio, see DISABLE_POSTMORTEM_STACKDUMP. arduino ide is not affected)

arihantdaga commented 5 years ago

Thank you @mcspr Yeah, in my case _save_crash_enabled is disabled and this is not happening due to saving crash info. And yes i think it is not an issue exactly with Espurna, but something with nofuss or esp8266httpUpdater itself(Which is used by Nofuss). However its note worthy for all espura users as well, that if we build it with latest core, and deliver OTAs to devices, few sectors will be overwritten. Depending on the code size it may overflow all 4 sectors also. Anyway, i'll create another issue on arduino esp8266 core as well, but the same could also be true for ArduinoOta library which is used for web based OTA.

Sectors from 378000 to 3f9000,3f9000 to 3fa000 , 3fa000 to 3fb000, 3fb000 to 3fc000 were used by eeprom rotate. And were clean. In flash dump also as expected. Then i took the dump of these same sectors right after OTA and saw these sectors were filled with something.

Screen Shot 2019-07-04 at 5 49 00 PM
arihantdaga commented 5 years ago

Also i checked the code of esp8266httpUpdater. It is internally using the same Update Class which is used directly in espurna's ota.ino. https://github.com/esp8266/Arduino/blob/16312949c976629db948b1daa635948064cfd037/libraries/ESP8266httpUpdate/src/ESP8266httpUpdate.cpp#L458

So i think this is all related, and could be related to Update Class.. What do you think @mcspr ?