zephyrproject-rtos / zephyr

Primary Git Repository for the Zephyr Project. Zephyr is a new generation, scalable, optimized, secure RTOS for multiple hardware architectures.
https://docs.zephyrproject.org
Apache License 2.0
10.64k stars 6.52k forks source link

MCUmgr: HardFault due to unaligned uint16_t access at `mcumgr_serial_process_flag` call to `sys_be16_to_cpu` #78873

Open doodek opened 2 weeks ago

doodek commented 2 weeks ago

Describe the bug Hard-Fault arising from unaligned access at mgmt/mcumgr/serial_util.c:87

The (see below) workaround, ensuring aligned access, has fixed this for me.

Workaround Replace op = sys_be16_to_cpu(*(uint16_t *)frag); at the line specified, with the following:

    if ((uintptr_t)frag % 2 != 0) {
        // Handle unaligned access 
        uint16_t temp;
        memcpy(&temp, frag, sizeof(temp));
        op = sys_be16_to_cpu(temp);
    } else {
        op = sys_be16_to_cpu(*(uint16_t *)frag);
    }

Reproduce Honestly no idea how to make this misaligned manually. Possibly try to use MCUmgr update utility over shell over UART, and then inject an unaligned pointer in frag argument with a debugger?

Expected behavior No hard faults

Impact Showstopper - can't deploy a device without a working update mechanism

Logs and console output Suspicious mcumgr comm dump

ALQCAACqAAEHAaRjbGVuGgABQyBjc2hhWCCHWXcoIB2fCRhQFXkytBgER7XzK9mibt3uwmSkHZYKdmNvZmYAZGRhdGFYbj2485YAAAAAAAIAANA/AQAAAAAAAwUA
\04\14AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAARY8=
\06 ABILAAAIAAEHAb9jb2ZmGG7/OBs=
E: *\06 ALQKAACqAAEIAaJjb2ZmGG5kZGF0YVicAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
\04\14AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAej0=
*�&'�(�UHY\�\16\1eW��\0b��\0bj\14U\15\12�H�J\02\1a��ɕ�сѡɕ��遂������ʲ�\02B�����ݹ�5R*�\02B��ѥ������ѕ�5*�\06    ALQKAACqAAEIAaJjb2ZmGG5kZGF0YVicAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
\04\14AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAej0=
\06

Stack trace:

arch_system_halt@0x0801896e (/workspaces/zephyr-ci/zephyr/kernel/fatal.c:30)
k_sys_fatal_error_handler@0x08014042 (/workspaces/zephyr-ci/zephyr/kernel/fatal.c:44)
z_fatal_error@0x080140ac (/workspaces/zephyr-ci/zephyr/kernel/fatal.c:118)
z_arm_fatal_error@0x0800ec26 (/workspaces/zephyr-ci/zephyr/arch/arm/core/fatal.c:86)
z_arm_fault@0x0800ecf4 (/workspaces/zephyr-ci/zephyr/arch/arm/core/cortex_m/fault.c:1157)
z_arm_hard_fault@0x0800ed6c (/workspaces/zephyr-ci/zephyr/arch/arm/core/cortex_m/fault_s.S:102)
<signal handler called>@0xfffffffd (Unknown Source:0)
mcumgr_serial_process_frag@0x080105b2 (/workspaces/zephyr-ci/zephyr/subsys/mgmt/mcumgr/transport/src/serial_util.c:88)
smp_shell_process@0x08010550 (/workspaces/zephyr-ci/zephyr/subsys/mgmt/mcumgr/transport/src/smp_shell.c:195)
update@0x080165e2 (/workspaces/zephyr-ci/zephyr/subsys/shell/backends/shell_uart.c:503)
shell_thread@0x0800e8b8 (/workspaces/zephyr-ci/zephyr/subsys/shell/shell.c:1369)
shell_thread@0x0800e8b8 (/workspaces/zephyr-ci/zephyr/subsys/shell/shell.c:1324)
z_thread_entry@0x08015da6 (/workspaces/zephyr-ci/zephyr/lib/os/thread_entry.c:48)
z_thread_entry@0x08015da6 (/workspaces/zephyr-ci/zephyr/lib/os/thread_entry.c:48)
??@0x377f5088 (Unknown Source:0)

ESF:

basic:
{...}
a1:
0x2000789c
r0:
0x2000789c
a2:
0x1c
r1:
0x1c
a3:
0x200
r2:
0x200
a4:
0x2000709c
r3:
0x2000709c
ip:
0x20006ebb
r12:
0x20006ebb
lr:
0x80116eb
r14:
0x80116eb
pc:
0x80105b2
r15:
0x80105b2
xpsr:
0x2100000

Environment (please complete the following information):

Additional context MCUmgr over Shell over UART, 9600 baudrate prj.conf extract:

CONFIG_SIZE_OPTIMIZATIONS=y
CONFIG_HEAP_MEM_POOL_SIZE=8192
CONFIG_MCUMGR_GRP_IMG_ALLOW_ERASE_PENDING=y
CONFIG_MCUMGR_GRP_IMG_DIRECT_UPLOAD=y
CONFIG_MCUMGR_GRP_OS_BOOTLOADER_INFO=y
CONFIG_MCUMGR_SMP_VERBOSE_ERR_RESPONSE=n
CONFIG_MCUMGR_GRP_IMG_VERBOSE_ERR=y
CONFIG_MCUMGR_GRP_IMG_UPDATABLE_IMAGE_NUMBER=2
CONFIG_UPDATEABLE_IMAGE_NUMBER=2
CONFIG_MCUBOOT_BOOTUTIL_LIB=y
CONFIG_MCUMGR_TRANSPORT_SHELL=y
CONFIG_MCUMGR_TRANSPORT_NETBUF_SIZE=512
CONFIG_MCUMGR_TRANSPORT_NETBUF_COUNT=4
CONFIG_MCUMGR_TRANSPORT_SHELL_RX_BUF_COUNT=4
CONFIG_MCUMGR_TRANSPORT_SHELL_MTU=384
CONFIG_MCUMGR_GRP_OS=y
CONFIG_MCUBOOT_IMG_MANAGER=y
CONFIG_MCUBOOT_UTIL_LOG_LEVEL_ERR=y
CONFIG_IMG_MANAGER=y
CONFIG_MCUMGR_GRP_IMG=y
doodek commented 2 weeks ago

possibly related #49066

nordicjm commented 2 weeks ago

Doesn't make sense what you are getting, that is either called from https://github.com/zephyrproject-rtos/zephyr/blob/4c9b30836770cba93a0212bc3fae5de20ba6e596/subsys/mgmt/mcumgr/smp_shell.c#L146 or https://github.com/zephyrproject-rtos/zephyr/blob/4c9b30836770cba93a0212bc3fae5de20ba6e596/subsys/mgmt/mcumgr/smp_uart.c#L41 both of which should be word aligned, it then directly uses the buf object, which again as can be seen from https://github.com/zephyrproject-rtos/zephyr/blob/main/include/zephyr/drivers/console/uart_mcumgr.h#L26 is word aligned since the void * object will be 4 bytes, will try this on an nrf51 board later to check but I'm pretty sure I've tested that recently over UART and it's worked fine

nordicjm commented 2 weeks ago

If you could put a breakpoint on those calls and see what address the buf and buf->data fields have it would be helpful

nordicjm commented 2 weeks ago

Cannot reproduce on a cortex m0, the address is word-aligned:

Breakpoint 1, mcumgr_serial_process_frag (rx_ctxt=rx_ctxt@entry=0x200019a8 <smp_shell_rx_ctxt>, frag=0x20005270 <net_buf_data_smp_shell_rx_pool> "\006\tABUKAAALAAkBAKFkYXJndoFjbG9sUQ8=\nnVwdGltZemC\n\377\263\177\001`\367-\001\004\337\357\023\004\337\377", frag_len=35) at /tmp/aa/zephyr/subsys/mgmt/mcumgr/transport/src/serial_util.c:77
77              if (rx_ctxt->nb == NULL) {
(gdb) p/x frag
$1 = 0x20005270
jhedberg commented 2 weeks ago

If the pointer is not always guaranteed to be correctly aligned, Zephyr already has a helper function sys_get_be16() that you should probably use.

nordicjm commented 2 weeks ago

If the pointer is not always guaranteed to be correctly aligned, Zephyr already has a helper function sys_get_be16() that you should probably use.

Network buffers should be properly aligned though? https://docs.zephyrproject.org/latest/kconfig.html#CONFIG_NET_BUF_ALIGNMENT

Alignment restriction for network buffers. This is useful for some hardware IP with DMA that requires the buffers to be aligned to a certain byte boundary, or dealing with cache line restrictions. Default value of 0 means the alignment will be the size of a void pointer, any other value will force the alignment of a net buffer in bytes.

jhedberg commented 2 weeks ago

Network buffers should be properly aligned though? https://docs.zephyrproject.org/latest/kconfig.html#CONFIG_NET_BUF_ALIGNMENT

That just guarantees you that the beginning of the net_buf payload buffer is aligned. Depending on what else is in the buffer or if there's e.g. some reserved headroom, buf->data can point at any position within the payload buffer. I'm not familiar with this specific subsystem or its use of buffers, but in general if you're parsing binary data received from a remote device, making alignment assumptions is a recipe for fragile code.

nordicjm commented 2 weeks ago

Network buffers should be properly aligned though? https://docs.zephyrproject.org/latest/kconfig.html#CONFIG_NET_BUF_ALIGNMENT

That just guarantees you that the beginning of the net_buf payload buffer is aligned. Depending on what else is in the buffer or if there's e.g. some reserved headroom, buf->data can be point at any position within the payload buffer. I'm not familiar with this specific subsystem or its use of buffers, but in general if you're parsing binary data received from a remote device, making alignment assumptions is a recipe for fragile code.

This is using the start of the buffer, the first byte is what is being looked at here. Shell -> network buffer -> buffers up until full length payload is received -> base64 decode of data to new network buffer -> check op-code (first byte) crash here on the OP's device, no crash on mine

nordicjm commented 2 weeks ago

Code:

void smp_shell_process(struct smp_shell_data *data)
{
    struct net_buf *buf;
    struct net_buf *nb;

    while (true) {
        buf = net_buf_get(&data->buf_ready, K_NO_WAIT);
        if (!buf) {
            break;
        }

        nb = mcumgr_serial_process_frag(&smp_shell_rx_ctxt,
                        buf->data,
                        buf->len);
        if (nb != NULL) {
            zephyr_smp_rx_req(&smp_shell_transport, nb);
        }
...

struct net_buf *mcumgr_serial_process_frag(
    struct mcumgr_serial_rx_ctxt *rx_ctxt,
    const uint8_t *frag, int frag_len)
{
...

    op = sys_be16_to_cpu(*(uint16_t *)frag); <--- here
jhedberg commented 2 weeks ago

@nordicjm thanks, I can see that. An equally important place to consider is this: https://github.com/zephyrproject-rtos/zephyr/blob/99e6280d7e22552de9a94992b626acdcbde00fee/subsys/mgmt/mcumgr/transport/src/smp_shell.c#L149-L167

There we can see that between allocation and inserting the buffer into a fifo all the code is doing is net_buf_add_u8() which will not modify the buf->data pointer, i.e. it should still point at the beginning of the payload buffer.

Either way, the code looks IMO fragile since mcumgr_serial_process_frag() receives a uint8_t * and not a uint16_t *, i.e. at the very least I'd place an assert in there to indicate the assumptions it makes about the alignment of the pointer.

jhedberg commented 2 weeks ago

Btw, the references in this report seem to be against some older version of Zephyr, since the c-files in question are found in a different location in the current main branch. Maybe that's one potential reason for the inability to reproduce the issue?

Edit: the report does say "Zephyr OS 3.7.0"