zephyrproject-rtos / zephyr

Primary Git Repository for the Zephyr Project. Zephyr is a new generation, scalable, optimized, secure RTOS for multiple hardware architectures.
https://docs.zephyrproject.org
Apache License 2.0
10.88k stars 6.63k forks source link

Temporary Modbus Client "Soft Lock" #79518

Open legoabram opened 1 month ago

legoabram commented 1 month ago

Describe the bug Using the modbus subsystem, a response returned after a request times out would increment the semaphore and stay until the next request is made which will immediately return when k_sem_take is called even before a response is returned. This will once again have the same problem when the actual response arrives. The result is that the modbus client will temporarily "soft lock".

To fix this, the wait semaphore just needs to be reset before transmitting.

To Reproduce I don't have an easy way to reproduce this yet. When I have time, I'll create a test case in the modbus twister test to reproduce this behavior.

Expected behavior Modbus client should recover for the next transmission, and ignore previously received information.

Impact In our case, this temporarily blocks any and all communication with the target device for several if not tens of seconds. Theoretically, this particular lockup could happen indefinitely.

Logs and console output This is an annotated log of what is happening, with a couple additional logs added to the modbus source code to make it clearer.

Start of Transmission:
[00:00:57.281,000] <dbg> modbus_serial: rtu_tx_adu: uart_buf
                                        01 03 84 f8 00 01 2d 0b                          |......-.
[00:00:57.281,000] <dbg> modbus_serial: rtu_tx_adu: Start frame transmission

Reception of Response:
[00:00:57.294,000] <dbg> modbus_serial: rtu_tmr_handler: uart_rx_buf
                                        01 03 02 00 04 b9 87                             |.......
Processed in Work Queue:
[00:00:57.294,000] <err> modbus_serial: modbus_rtu_rx_adu

Start of Transmission:
[00:00:57.320,000] <dbg> modbus_serial: rtu_tx_adu: uart_buf
                                        01 03 84 f9 00 01 7c cb                          |......|.
[00:00:57.320,000] <dbg> modbus_serial: rtu_tx_adu: Start frame transmission

Reception of Response:
[00:00:57.333,000] <dbg> modbus_serial: rtu_tmr_handler: uart_rx_buf
                                        01 03 02 00 00 b8 44                             |......D

No Processing in Work Queue

Start of Transmission:
[00:00:57.495,000] <dbg> modbus_serial: rtu_tx_adu: uart_buf
                                        01 03 84 fa 00 01 8c cb                          |........
[00:00:57.495,000] <dbg> modbus_serial: rtu_tx_adu: Start frame transmission

Reception of Response:
[00:00:57.508,000] <dbg> modbus_serial: rtu_tmr_handler: uart_rx_buf
                                        01 03 02 00 00 b8 44                             |......D

No Processing in Work Queue

Repeat adnausium....

Start of Transmission:
[00:01:00.165,000] <dbg> modbus_serial: rtu_tx_adu: uart_buf
                                        01 03 85 0f 00 01 9d 05                          |........
[00:01:00.165,000] <dbg> modbus_serial: rtu_tx_adu: Start frame transmission

Reception of Response:
[00:01:00.179,000] <dbg> modbus_serial: rtu_tmr_handler: uart_rx_buf
                                        01 03 02 1a 0c b3 21                             |......!

Finally Processed in Work Queue. Delayed for almost 3 seconds. By this point, the last request has already timed out, but the semaphore is incremented anyway.
[00:01:00.326,000] <err> modbus_serial: modbus_rtu_rx_adu

Start of Transmission:
[00:01:00.340,000] <dbg> modbus_serial: rtu_tx_adu: uart_buf
                                        02 03 85 0f 00 01 9d 36                          |.......6
[00:01:00.340,000] <dbg> modbus_serial: rtu_tx_adu: Start frame transmission

Immediate "Response" since the semaphore is incremented, but with the response to the previous entry.
[00:01:00.340,000] <err> modbus_c: Failed to validate unit ID or function code

Reception of Response:
[00:01:00.353,000] <dbg> modbus_serial: rtu_tmr_handler: uart_rx_buf
                                        02 03 02 1a 29 36 fa                             |....)6.

Processed in Work Queue, but already aborted earlier, so it causes the problem again:
[00:01:00.353,000] <err> modbus_serial: modbus_rtu_rx_adu

Repeat adnausium....

Environment:

jfischer-no commented 1 month ago

Not sure the fix needs to be backported to v3.5(EOL)

legoabram commented 1 month ago

I'd prefer if we could since it makes our bug fix path much easier, but I'd understand if you decide against it.

nordicjm commented 1 month ago

3.5 is not supported, no PRs allowed against that branch