zephyrproject-rtos / zephyr

Primary Git Repository for the Zephyr Project. Zephyr is a new generation, scalable, optimized, secure RTOS for multiple hardware architectures.
https://docs.zephyrproject.org
Apache License 2.0
10.62k stars 6.5k forks source link

Usage faults on semaphore usage in driver (stm32l1) #46049

Closed soerlemans closed 2 years ago

soerlemans commented 2 years ago

Me and my team of students are working on a driver for an Ublox SARA N310. We are planning on merging this driver into Zephyr when the driver is complete.

We currently have a critical issue where a semaphore causes a usage fault deep into Zephyr itself. When we use a response semaphore for the modem_cmd_send() function it gives a usage fault. We call this function via a wrapper function called send_at_command() its does nothing but forward the arguments to the modem_cmd_send() function. The send_at_command() function gets used by a number of our public functions defined in drivers header file.

The catch is that the function does not fail on the first few calls. It fails after a set amount of calls to the function in rapid succession. The amount can vary when I added a new public function that used send_at_command() function the problem shifted a few functions down (my guess is that something is going wrong in the stack). On some functions it always fails on the semaphore and gives a usage fault. Sometimes when a function is called twice it will fail. With a usage fault on the second call. Different public functions give different usage faults.

The problem does not occur when slowly stepping through the code with a debugger. I included the result of a thread apply all bt in a file for review. The problem happens in modem_cmd_send_ext() on the second call to k_sem_take() (line 527). The usage fault is invoked.

thread apply all bt:

(gdb) thread apply all bt
Info : Getting thread 536874392 reg list

Thread 6 (Thread 536874392):
#0  arch_swap (key=key@entry=0) at /home/hackerman/Zephyr/zephyrproject/zephyr/arch/arm/core/aarch32/swap.c:53
#1  0x08009874 in z_swap_irqlock (key=0) at /home/hackerman/Zephyr/zephyrproject/zephyr/kernel/include/kswap.h:196
#2  z_swap (key=..., lock=0x200015ec <pending_cancels>) at /home/hackerman/Zephyr/zephyrproject/zephyr/kernel/include/kswap.h:196
#3  z_pend_curr (lock=lock@entry=0x200015ec <pending_cancels>, key=..., wait_q=wait_q@entry=0x20000b28 <mdata+832>, timeout=...) at /home/hackerman/Zephyr/zephyrproject/zephyr/kernel/sched.c:815
#4  0x08009140 in z_impl_k_sem_take (sem=sem@entry=0x20000b28 <mdata+832>, timeout=...) at /home/hackerman/Zephyr/zephyrproject/zephyr/kernel/sem.c:143
#5  0x0800f238 in k_sem_take (timeout=..., sem=0x20000b28 <mdata+832>) at /home/hackerman/Zephyr/zephyrproject/zephyr/build/zephyr/include/generated/syscalls/kernel.h:1035
#6  modem_cmd_send_ext (iface=iface@entry=0x20000814 <mdata+44>, handler=handler@entry=0x20000824 <mdata+60>, handler_cmds=handler_cmds@entry=0x0, handler_cmds_len=handler_cmds_len@entry=0, buf=buf@entry=0x20002164 <z_main_stack+124> "AT+ULGASP=2,last gasp,0,,17,192.168.0.10:5050,,1,1,1", sem=sem@entry=0x20000b28 <mdata+832>, timeout=timeout@entry=..., flags=flags@entry=0) at /home/hackerman/Zephyr/zephyrproject/zephyr/drivers/modem/modem_cmd_handler.c:527
#7  0x0800f730 in modem_cmd_send (timeout=..., sem=0x20000b28 <mdata+832>, buf=0x20002164 <z_main_stack+124> "AT+ULGASP=2,last gasp,0,,17,192.168.0.10:5050,,1,1,1", handler_cmds_len=0, handler_cmds=0x0, handler=0x20000824 <mdata+60>, iface=0x20000814 <mdata+44>) at /home/hackerman/Zephyr/zephyrproject/zephyr/drivers/modem/modem_cmd_handler.h:226
#8  send_at_command (iface=iface@entry=0x20000814 <mdata+44>, handler=handler@entry=0x20000824 <mdata+60>, handler_cmds=handler_cmds@entry=0x0, handler_cmds_len=handler_cmds_len@entry=0, buf=buf@entry=0x20002164 <z_main_stack+124> "AT+ULGASP=2,last gasp,0,,17,192.168.0.10:5050,,1,1,1", sem=sem@entry=0x20000b28 <mdata+832>, timeout=..., tx_lock=tx_lock@entry=true) at /home/hackerman/Zephyr/zephyrproject/zephyr/drivers/modem/ublox-sara-n310.c:190
#9  0x080078d6 in n310_set_last_gasp (gpio_pin=gpio_pin@entry=16, text=text@entry=0x8010ce4 "Hello there!", ip_proto=ip_proto@entry=17, ip_addr=ip_addr@entry=0x8010cd0 "192.168.0.10:5050", tx_cnt=tx_cnt@entry=1, shutdown=shutdown@entry=1) at /home/hackerman/Zephyr/zephyrproject/zephyr/include/sys/time_units.h:405
#10 0x08000632 in main () at /home/hackerman/Zephyr/zephyrproject/zephyr/samples/drivers/ublox_sara_n310/src/main.c:59
Info : Getting thread 536874224 reg list

Thread 5 (Thread 536874224):
#0  arch_cpu_idle () at /home/hackerman/Zephyr/zephyrproject/zephyr/arch/arm/core/aarch32/cpu_idle.S:108
#1  0x0800fdc0 in k_cpu_idle () at /home/hackerman/Zephyr/zephyrproject/zephyr/include/kernel.h:5659
#2  idle (unused1=<optimized out>, unused2=<optimized out>, unused3=<optimized out>) at /home/hackerman/Zephyr/zephyrproject/zephyr/kernel/idle.c:86
#3  0x0800a8b0 in z_thread_entry (entry=0x800fdab <idle>, p1=<optimized out>, p2=<optimized out>, p3=<optimized out>) at /home/hackerman/Zephyr/zephyrproject/zephyr/lib/os/thread_entry.c:36
#4  0x74726174 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
Info : Getting thread 536872520 reg list

Thread 4 (Thread 536872520):
#0  cbvprintf (out=out@entry=0x800b53b <out_func>, ctx=0x8010f2c <log_output_uart>, fp=0x8010e77 "> ", fp@entry=0x1 "$", ap=...) at /home/hackerman/Zephyr/zephyrproject/zephyr/lib/os/cbprintf_complete.c:1717
#1  0x08001f1a in print_formatted (output=<optimized out>, fmt=0x8010e74 "<%s> ") at /home/hackerman/Zephyr/zephyrproject/zephyr/subsys/logging/log_output.c:144
#2  0x080020fa in ids_print (output=output@entry=0x8010f2c <log_output_uart>, level_on=level_on@entry=true, func_on=func_on@entry=false, domain_id=domain_id@entry=0, source_id=source_id@entry=2, level=level@entry=3) at /home/hackerman/Zephyr/zephyrproject/zephyr/subsys/logging/log_output.c:265
#3  0x08002188 in prefix_print (output=output@entry=0x8010f2c <log_output_uart>, flags=flags@entry=15, func_on=func_on@entry=false, timestamp=17911, level=level@entry=3 '\003', domain_id=0 '\000', source_id=2) at /home/hackerman/Zephyr/zephyrproject/zephyr/subsys/logging/log_output.c:541
#4  0x080021e6 in log_output_msg2_process (output=0x8010f2c <log_output_uart>, msg=0x20000510 <buf32+768>, flags=15) at /home/hackerman/Zephyr/zephyrproject/zephyr/subsys/logging/log_output.c:610
#5  0x08002362 in process (backend=<optimized out>, msg=0x20000510 <buf32+768>) at /home/hackerman/Zephyr/zephyrproject/zephyr/subsys/logging/log_backend_uart.c:109
#6  0x08001d10 in log_backend_msg2_process (msg=0x20000510 <buf32+768>, backend=0x8010b38 <log_backend_uart>) at /home/hackerman/Zephyr/zephyrproject/zephyr/include/logging/log_backend.h:135
#7  msg_process (msg=..., bypass=bypass@entry=false) at /home/hackerman/Zephyr/zephyrproject/zephyr/subsys/logging/log_core.c:776
#8  0x08001d6a in z_impl_log_process (bypass=bypass@entry=false) at /home/hackerman/Zephyr/zephyrproject/zephyr/subsys/logging/log_core.c:857
#9  0x08001ea8 in log_process (bypass=false) at /home/hackerman/Zephyr/zephyrproject/zephyr/build/zephyr/include/generated/syscalls/log_ctrl.h:56
--Type <RET> for more, q to quit, c to continue without paging--
#10 log_process_thread_func (dummy1=<optimized out>, dummy2=<optimized out>, dummy3=<optimized out>) at /home/hackerman/Zephyr/zephyrproject/zephyr/subsys/logging/log_core.c:1340
#11 0x0800a8b0 in z_thread_entry (entry=0x8001e95 <log_process_thread_func>, p1=<optimized out>, p2=<optimized out>, p3=<optimized out>) at /home/hackerman/Zephyr/zephyrproject/zephyr/lib/os/thread_entry.c:36
#12 0x08010bb0 in __devicehdl_DT_N_S_soc_S_pin_controller_40020000_S_gpio_40020000 ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
Info : Getting thread 536874568 reg list

Thread 3 (Thread 536874568):
#0  arch_swap (key=key@entry=0) at /home/hackerman/Zephyr/zephyrproject/zephyr/arch/arm/core/aarch32/swap.c:53
#1  0x08009874 in z_swap_irqlock (key=0) at /home/hackerman/Zephyr/zephyrproject/zephyr/kernel/include/kswap.h:196
#2  z_swap (key=..., lock=0x200015ec <pending_cancels>) at /home/hackerman/Zephyr/zephyrproject/zephyr/kernel/include/kswap.h:196
#3  z_pend_curr (lock=lock@entry=0x200015ec <pending_cancels>, key=..., wait_q=wait_q@entry=0x20000ef8 <k_sys_work_q+176>, timeout=timeout@entry=...) at /home/hackerman/Zephyr/zephyrproject/zephyr/kernel/sched.c:815
#4  0x08009ca8 in z_sched_wait (lock=lock@entry=0x200015ec <pending_cancels>, key=..., key@entry=..., wait_q=wait_q@entry=0x20000ef8 <k_sys_work_q+176>, timeout=..., data=data@entry=0x0) at /home/hackerman/Zephyr/zephyrproject/zephyr/kernel/sched.c:1795
#5  0x0800935a in work_queue_main (workq_ptr=0x20000e48 <k_sys_work_q>, p2=<optimized out>, p3=<optimized out>) at /home/hackerman/Zephyr/zephyrproject/zephyr/kernel/work.c:641
#6  0x0800a8b0 in z_thread_entry (entry=0x8009325 <work_queue_main>, p1=<optimized out>, p2=<optimized out>, p3=<optimized out>) at /home/hackerman/Zephyr/zephyrproject/zephyr/lib/os/thread_entry.c:36
#7  0x00000046 in ?? ()
#8  0x00000126 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
Info : Getting thread 536872760 reg list

Thread 2 (Thread 536872760):
#0  arch_swap (key=key@entry=0) at /home/hackerman/Zephyr/zephyrproject/zephyr/arch/arm/core/aarch32/swap.c:53
#1  0x08009874 in z_swap_irqlock (key=0) at /home/hackerman/Zephyr/zephyrproject/zephyr/kernel/include/kswap.h:196
#2  z_swap (key=..., lock=0x20000730 <rx_classes+16>) at /home/hackerman/Zephyr/zephyrproject/zephyr/kernel/include/kswap.h:196
#3  z_pend_curr (lock=lock@entry=0x20000728 <rx_classes+8>, key=key@entry=..., wait_q=wait_q@entry=0x20000728 <rx_classes+8>, timeout=...) at /home/hackerman/Zephyr/zephyrproject/zephyr/kernel/sched.c:815
#4  0x08009096 in z_impl_k_queue_get (queue=queue@entry=0x20000720 <rx_classes>, timeout=...) at /home/hackerman/Zephyr/zephyrproject/zephyr/kernel/queue.c:343
#5  0x0800de7a in k_queue_get (timeout=..., queue=0x20000720 <rx_classes>) at /home/hackerman/Zephyr/zephyrproject/zephyr/build/zephyr/include/generated/syscalls/kernel.h:659
#6  tc_rx_handler (fifo=0x20000720 <rx_classes>) at /home/hackerman/Zephyr/zephyrproject/zephyr/subsys/net/ip/net_tc.c:245
#7  0x0800a8b0 in z_thread_entry (entry=0x800de63 <tc_rx_handler>, p1=<optimized out>, p2=<optimized out>, p3=<optimized out>) at /home/hackerman/Zephyr/zephyrproject/zephyr/lib/os/thread_entry.c:36
#8  0x3b315b1a in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

Thread 1 (Thread 536874056):
#0  arch_system_halt (reason=reason@entry=0) at /home/hackerman/Zephyr/zephyrproject/zephyr/kernel/fatal.c:32
#1  0x0800895a in k_sys_fatal_error_handler (reason=reason@entry=0, esf=esf@entry=0x20002d50 <z_interrupt_stacks+1896>) at /home/hackerman/Zephyr/zephyrproject/zephyr/kernel/fatal.c:44
#2  0x080089d2 in z_fatal_error (reason=reason@entry=0, esf=esf@entry=0x20002d50 <z_interrupt_stacks+1896>) at /home/hackerman/Zephyr/zephyrproject/zephyr/kernel/fatal.c:127
#3  0x0800c296 in z_arm_fatal_error (reason=reason@entry=0, esf=esf@entry=0x20002d50 <z_interrupt_stacks+1896>) at /home/hackerman/Zephyr/zephyrproject/zephyr/arch/arm/core/aarch32/fatal.c:63
#4  0x0800396c in z_arm_fault (msp=536882664, psp=<optimized out>, exc_return=4294967293, callee_regs=0x20002db8 <z_interrupt_stacks+2000>) at /home/hackerman/Zephyr/zephyrproject/zephyr/arch/arm/core/aarch32/cortex_m/fault.c:1049
--Type <RET> for more, q to quit, c to continue without paging--
#5  0x080039a4 in z_arm_usage_fault () at /home/hackerman/Zephyr/zephyrproject/zephyr/arch/arm/core/aarch32/cortex_m/fault_s.S:102
#6  <signal handler called>
#7  0x735f646c in ?? ()
#8  0x635f6e6e in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

Im at a point where im starting to think where this might not be a fault in my code anymore. Im running zephyr on a STM32l1.

The program shows the following output:

...
[00:00:17.484,000] <dbg> modem_ublox_sara_n310: on_cmd_uantr: UANTR:0, -1
[00:00:17.485,000] <dbg> modem_cmd_handler: cmd_handler_process_rx_buf: match cmd [OK] (len:2)
[00:00:17.486,000] <inf> main: Antenna load: -1
[00:00:17.512,000] <dbg> modem_cmd_handler: cmd_handler_process_rx_buf: match cmd [+CSQ: ] (len:10)
[00:00:17.512,000] <dbg> modem_ublox_sara_n310: on_cmd_csq: CSQ:30, 0
[00:00:17.513,000] <dbg> modem_cmd_handler: cmd_handler_process_rx_buf: match cmd [OK] (len:2)
[00:00:17.513,000] <inf> main: Signal quality: 30, 0
[00:00:17.545,000] <err> os: ***** USAGE FAULT *****
[00:00:17.545,000] <err> os:   Illegal load of EXC_RETURN into PC
[00:00:17.545,000] <err> os: r0/a1:  0x00000000  r1/a2:  0x0000ea60  r2/a3:  0x00000000
[00:00:17.545,000] <err> os: r3/a4:  0x080094b9 r12/ip:  0x00000000 r14/lr:  0x0801038b
[00:00:17.546,000] <err> os:  xpsr:  0xe000ec00
[00:00:17.546,000] <err> os: r4/v1:  0x00000000  r5/v2:  0x00000000  r6/v3:  0x00000000
[00:00:17.546,000] <err> os: r7/v4:  0x00000000  r8/v5:  0x00000000  r9/v6:  0x00000000
[00:00:17.546,000] <err> os: r10/v7: 0x00000000  r11/v8: 0x00000000    psp:  0x200020a0
[00:00:17.546,000] <err> os: EXC_RETURN: 0xfffffffd
[00:00:17.546,000] <err> os: Faulting instruction address (r15/pc): 0x00000000
[00:00:17.546,000] <err> os: >>> ZEPHYR FATAL ERROR 0: CPU exception on CPU 0
[00:00:17.546,000] <err> os: Current thread: 0x20000c48 (unknown)
[00:00:17.701,000] <err> os: Halting system

the main.c:

/*
 * Copyright (c) 2021 Abel Sensors
 *
 * SPDX-License-Identifier: Apache-2.0
 */

#include <zephyr.h>
#include <net/socket.h>
#include <drivers/modem/ublox-sara-n310.h>

#define LOG_LEVEL CONFIG_LOG_DEFAULT_LEVEL
#include <logging/log.h>
LOG_MODULE_REGISTER(main);

//#define N310_DEVICE_LABEL DT_LABEL(DT_INST(0, ublox_sara_n310))
#define N310_DEVICE_LABEL "SARAN310"
#define MAX_BUF 512

/* PSM TIME SETTINGS */
#define PERIODIC_TAU "01011111"
#define ACTIVE_TIME "00000000"

void main(void)
{
    static const struct device *dev_ubloxn3;
    struct sockaddr_in addr;
    int n = 0, sockfd = 0;

    /* check if device exists in device tree */
    dev_ubloxn3 = device_get_binding(N310_DEVICE_LABEL);

    if (dev_ubloxn3 == NULL) {
        LOG_ERR("Failed to get device binding.");
        return;
    }

    LOG_INF("Manufacturer: %s", log_strdup(n310_get_manufacturer()));
    LOG_INF("Model: %s", log_strdup(n310_get_model()));
    LOG_INF("Revision: %s", log_strdup(n310_get_revision()));
    LOG_INF("ICCID: %s", log_strdup(n310_get_iccid()));
    LOG_INF("IMEI: %s", log_strdup(n310_get_imei()));
    LOG_INF("IP: %s", log_strdup(n310_get_ip()));
    LOG_INF("Network state: %d", n310_get_network_state());
    LOG_INF("Jamming state: %d", n310_get_jamming_state());
    LOG_INF("Antenna load: %d", n310_get_antenna_load());

    struct signal_quality buffer;
    n310_get_signal_quality(&buffer);
    LOG_INF("Signal quality: %d, %d", buffer.rssi, buffer.quality);

    // Sending an AT command from the main.c breaks the program
        // These functions all fail on the semaphore
    //n310_psm_set_csclk(2);
    //n310_psm_config(PSM_ENABLE, PERIODIC_TAU, ACTIVE_TIME);
    //n310_psm_set_mode(PM2);

    n310_set_last_gasp(16, "Hello there!", 17, "192.168.0.10:5050", 1, 1); // USAGE fault here
        ...

prj.conf:

### required for driver but selected by board/SoC already ###
CONFIG_GPIO=y
CONFIG_SERIAL=y

### required for driver but not selected by board/SoC ###
CONFIG_MODEM=y
CONFIG_MODEM_UBLOX_SARA_N310=y

CONFIG_NETWORKING=y
CONFIG_NET_SOCKETS=y
CONFIG_NET_SOCKETS_OFFLOAD=y # Must be enabled for socket offloading
CONFIG_NET_SOCKETS_POSIX_NAMES=y
CONFIG_NET_IPV4=y
CONFIG_NET_IPV6=n

### debugging ###
CONFIG_LOG=y
CONFIG_LOG_STRDUP_BUF_COUNT=6

### needed to suppress "sys_rand_get" compiling error ###
CONFIG_TEST_RANDOM_GENERATOR=y

###### DEBUGGING TOOLS ######
CONFIG_DEBUG_THREAD_INFO=y
CONFIG_THREAD_NAME=y
CONFIG_DEBUG=y
#CONFIG_ASSERT=y
CONFIG_LOG=y
CONFIG_LOG_DEFAULT_LEVEL=4
CONFIG_GPIO_LOG_LEVEL_DBG=y
CONFIG_I2C_LOG_LEVEL_DBG=y
CONFIG_SENSOR_LOG_LEVEL_DBG=y
CONFIG_MODEM_LOG_LEVEL_DBG=y

# Try this
CONFIG_MODEM_IFACE_UART=y
CONFIG_EXTRA_EXCEPTION_INFO=y

On request I will send the driver code (or else this post might be a little to lengthy (its 2000 LoC), to properly read). Zephyr version is 3.0.99.

carlescufi commented 2 years ago

Can you please double all stack sizes and try again? You can find all stacks by looking for _STACK_SIZE in build/zephyr/.config

alissa-maria commented 2 years ago

@carlescufi Thank you, this fixed our issue (am also part of the team). Is there a way to define the stack sizes in the prj.conf of a program?

soerlemans commented 2 years ago

After enabling some (KConfig) options for MPU protection. We quickly got the following output after testing again.

[00:00:16.167,000] <err> os: r0/a1:  0x00000002  r1/a2:  0x20000c68  r2/a3:  0xf0f0f0f0
[00:00:16.167,000] <err> os: r3/a4:  0x20001f48 r12/ip:  0x00000003 r14/lr:  0x08006239
[00:00:16.167,000] <err> os:  xpsr:  0x41000036
[00:00:16.167,000] <err> os: EXC_RETURN: 0x0
[00:00:16.167,000] <err> os: Faulting instruction address (r15/pc): 0x080136dc
[00:00:16.167,000] <err> os: >>> ZEPHYR FATAL ERROR 2: Stack overflow on CPU 0
[00:00:16.167,000] <err> os: Fault during interrupt handling

We are apparently getting a stack overflow somewhere. What would be the best way to deal with this stack overflow? We use a lot of strings for sending AT commands and those can get quite long. There is no way for us to easily optimize and save more memory.

Is just doubling the stack size an acceptable solution (as we are planning on merging the driver into Zephyr, so correctness matters).

soerlemans commented 2 years ago

We doubled the stack sizes in the prj.conf and accepting this as our solution as it overflows if we do not double the CONFIG_LOG_PROCESS_THREAD then we get a stack overflow. So its reasonable to double the stack sizes.