Lorawan: FUOTA callback funciton mistakenly called after early exit failure from too many missing fragments

knitHacker commented 2 months ago

Describe the bug The issue occurs when trying to complete a FUOTA session through the Lorawan stack. There is a log message indicating a fuota session has completed and the user's callback function is called even though the fuota session failed due to missing fragments from the block not being received.

In the FUOTA session if the missing number of fragments is larger than the max redundancy the function FragDecoderProcess (in the loramac-node code base) will return FRAG_SESSION_FINISHED and mark the the error by setting FragDecoder.Status.MatrixError to 1. When processing data fragments in the lorawan frag_transport service the return status assumes that if FRAG_SESSION_FINISHED value is returned then the session completed successfully and finishes up the session as if it had succeeded without checking the MatrixError value (called memory_error in the zephyr lorawan service code). This will also call the user's call back function which assumes that the fuota session succeeded along with some misleading log messages indicating it succeeded. The only time the memory_error is examined seems to be in the frag status response message.

Please also mention any information which could help others to understand the problem you're facing: I am working off a custom board with a STM32WL and using AWS for the FUOTA server.

To Reproduce Steps to reproduce the behavior: Set the CONFIG_LORAWAN_FRAG_TRANSPORT_MAX_REDUNDANCY very low and start a session with a large file to force more missing frags than the redundancy could account for when reassembling the file.

I believe I first saw it when trying to do multiple AWS fuota sessions in succession while debugging. After starting a new session before the previous one was complete AWS would start sending from where it left off in the previous session after sending the new session setup message. This meant that the device didn't receive the first part of the block or even worse only received the redundancy packets without any of the original fragments.

Expected behavior It should not call the user callback function meant to be called after a successful FUOTA session and it shouldn't log that the FUOTA has finished successfully.

Impact The code will try to load an incomplete binary into memory to be booted from in the next reboot which could occur after the user's callback function is called since that is what the suggested usage for that callback function.

Environment (please complete the following information):

OS: Linux
Toolchain: Zephyr / west
Commit SHA or Version used: zephyr 3.6.0rc3 with the lorawan frag transport PR changes added (https://github.com/zephyrproject-rtos/zephyr/pull/68570)

github-actions[bot] commented 2 months ago

Hi @knitHacker! We appreciate you submitting your first issue for our open-source project. 🌟

Even though I'm a bot, I can assure you that the whole community is genuinely grateful for your time and effort. 🤖💙

martinjaeger commented 2 months ago

Thanks @knitHacker for your report. I'm quite busy with other work at the moment, but will have a look asap, probably next week.

martinjaeger commented 1 month ago

@knitHacker We tried to reproduce your reported issue by adding another test case in tests/subsys/lorawan/frag_decoder. However, I couldn't make it fail even with very low redundancy setting. Could you have a look and give me some guidance (or even push a failing test case) to reproduce the issue you described?

zephyrproject-rtos / zephyr

Lorawan: FUOTA callback funciton mistakenly called after early exit failure from too many missing fragments #72764