zephyrproject-rtos / zephyr

Primary Git Repository for the Zephyr Project. Zephyr is a new generation, scalable, optimized, secure RTOS for multiple hardware architectures.
https://docs.zephyrproject.org
Apache License 2.0
10.93k stars 6.65k forks source link

shell_print crashes the board on `bt bonds` and `bt connections` #41796

Open MrOneTwo opened 2 years ago

MrOneTwo commented 2 years ago

My project is mostly (using the Just Works pairing method by not running bt_conn_auth_cb_register) the same as peripheral_hids. The problem is that:

Then I run following commands in the RTT shell:

rtt:~$ bt bonds
bt bonds
Bonded devices:
Total 0
rtt:~$ bt connections 
bt connections
Connected devices:
[00:00:19.206,909] <err> os: ***** USAGE FAULT *****
[00:00:19.206,909] <err> os:   Unaligned memory access
[00:00:19.206,909] <err> os: r0/a1:  0x00028d8f  r1/a2:  0x00001000  r2/a3:  0x00032f6f
[00:00:19.206,939] <err> os: r3/a4:  0x00028d8f r12/ip:  0x0000000a r14/lr:  0x0002626f
[00:00:19.206,939] <err> os:  xpsr:  0x21000000
[00:00:19.206,939] <err> os: Faulting instruction address (r15/pc): 0x00026076
[00:00:19.206,939] <err> os: >>> ZEPHYR FATAL ERROR 0: CPU exception on CPU 0
[00:00:19.206,970] <err> os: Current thread: 0x200010a0 (shell_rtt)
[00:00:19.531,158] <err> os: Halting system

Running addr2line -e zephyr/zephyr.elf 0x00026076 puts me here /home/mc/gits/magknob-zephyr-ble/zephyr/include/sys/atomic_builtin.h:243

Assume I also paired the board with an iPhone. Then running bt bonds crashes the board also.

rtt:~$ Connected 6C:6D:7D:A0:83:C4 (random)
Security changed: 6C:6D:7D:A0:83:C4 (random) level 2
bt bonds
bt bonds
Bonded devices:
[00:00:54.425,994] <err> os: ***** USAGE FAULT *****
[00:00:54.425,994] <err> os:   Unaligned memory access
[00:00:54.425,994] <err> os: r0/a1:  0x00028d8f  r1/a2:  0x00001000  r2/a3:  0x00032f3f
[00:00:54.425,994] <err> os: r3/a4:  0x00028d8f r12/ip:  0x0000000a r14/lr:  0x0002626f
[00:00:54.426,025] <err> os:  xpsr:  0x21000000
[00:00:54.426,025] <err> os: Faulting instruction address (r15/pc): 0x00026076
[00:00:54.426,025] <err> os: >>> ZEPHYR FATAL ERROR 0: CPU exception on CPU 0
[00:00:54.426,025] <err> os: Current thread: 0x200010a0 (shell_rtt)
[00:00:54.747,894] <err> os: Halting system

The address points me to /home/mc/gits/magknob-zephyr-ble/zephyr/include/sys/atomic_builtin.h:243 again.

The crash doesn't happen if I comment out the shell_print in this function (which gets run by cmd_bonds):

static void bond_info(const struct bt_bond_info *info, void *user_data)
{
    char addr[BT_ADDR_LE_STR_LEN];
    int *bond_count = user_data;

    bt_addr_le_to_str(&info->addr, addr, sizeof(addr));
    shell_print(ctx_shell, "Remote Identity: %s", addr);  // < if I comment this out, it works
    (*bond_count)++;
}

In general I can pair with iPhone for the first time (when prompted for pairing) but after disconnecting and trying to reconnect I get:

rtt:~$ Connected 78:7F:AE:13:34:85 (random)
Security failed: 78:7F:AE:13:34:85 (random) level 1 err 2
Disconnected from 78:7F:AE:13:34:85 (random) (reason 0x13)

That's for Just Works. When using setup reconnecting works fine. I need to pair without any pin though. This might be a separate issue but maybe connected...?

MrOneTwo commented 2 years ago

Trying to debug this.

I have added CONFIG_SHELL_BACKEND_SERIAL=n since I don't need it.

I see that the problem stems from the subsys/shell/shell.c:shel_vfprintf doing

void shell_vfprintf(const struct shell *sh, enum shell_vt100_color color,
           const char *fmt, va_list args)
{
    __ASSERT_NO_MSG(sh);
    __ASSERT(!k_is_in_isr(), "Thread context required.");
    __ASSERT_NO_MSG(sh->ctx);
    __ASSERT_NO_MSG(z_flag_cmd_ctx_get(sh) ||
            (k_current_get() != sh->ctx->tid));
    __ASSERT_NO_MSG(sh->fprintf_ctx);
    __ASSERT_NO_MSG(fmt);

    /* Sending a message to a non-active shell leads to a dead lock. */
    if (state_get(sh) != SHELL_STATE_ACTIVE) {
        z_flag_print_noinit_set(sh, true);
        return;
    }
   ...
}

The sh->ctx->state has value of 255 and not the SHELL_STATE_ACTIVE (2). That means it execute z_flag_print_noinit_set(sh, true) to set the sh->ctx->print_noinit flag. It ends in atomic_or. The argument in atomic_or, atomic_t *target has address of 0x28a85. The ldrex op tries to read from that address into r0 register. That's where my firmware crashes!

0x25d66 <atomic_or>     mov     r3, r0                                                                                                                                                    
0x25d68 <atomic_or+2>   dmb     ish                                                                                                                                                       
0x25d6c <atomic_or+6>   ldrex   r0, [r3]          .................. CRASH!                                                                                                             
0x25d70 <atomic_or+10>  orr.w   r2, r0, r1                                                                                                                                                
0x25d74 <atomic_or+14>  strex   r12, r2, [r3]                                                                                                                                             
0x25d78 <atomic_or+18>  cmp.w   r12, #0                                                                                                                                                   
0x25d7c <atomic_or+22>  bne.n   0x25d6c <atomic_or+6>                                                                                                                                     
0x25d7e <atomic_or+24>  dmb     ish                                                                                                                                                       
0x25d82 <atomic_or+28>  bx      lr                                                                                                                                                        
0x25d84 <atomic_and>    mov     r3, r0                                                                                                                                                    
0x25d86 <atomic_and+2>  dmb     ish                                                                                                                                                       
0x25d8a <atomic_and+6>  ldrex   r0, [r3]                                                                                                                                                  
0x25d8e <atomic_and+10> and.w   r2, r0, r1                                                                                                                                                
0x25d92 <atomic_and+14> strex   r12, r2, [r3]                                                                                                                                             
0x25d96 <atomic_and+18> cmp.w   r12, #0                                                                                                                                                   
0x25d9a <atomic_and+22> bne.n   0x25d8a <atomic_and+6>
MrOneTwo commented 2 years ago

In cmd_bonds printing to shell is done with sh pointer which points to shell_rtt (and this printing works) but in the bond_info the shell is:

(gdb) p shell
$24 = (const struct shell *) 0x0 <_vector_table>

???

MrOneTwo commented 2 years ago

ctx_shell in shell/bt.c is invalid because the cmd_init never runs! Running bt init manually actually runs that function which binds ctx_shell.

MrOneTwo commented 2 years ago

I've created a PR with a simple solution https://github.com/zephyrproject-rtos/zephyr/pull/41844

jhedberg commented 2 years ago

I downgraded this to an Enhancelent/Feature request, since the current Bluetooth shell has been intentionally designed with the assumption that the Bluetooth stack is always initialised through the shell. Please also see my comments in the linked PR.

arbrauns commented 1 year ago

I also just hit this (again).

the current Bluetooth shell has been intentionally designed with the assumption that the Bluetooth stack is always initialised through the shell

Not disagreeing that that's how it is, but it's a really weird design, right? Shell applications are first and foremost debugging tools, if I'm debugging some bluetooth issue I'm not gonna start rewriting my application to no longer initialize the bluetooth stack by itself. bt is the only shell command I'm aware of that works like this.

Thalley commented 4 months ago

I downgraded this to an Enhancelent/Feature request, since the current Bluetooth shell has been intentionally designed with the assumption that the Bluetooth stack is always initialised through the shell. Please also see my comments in the linked PR.

Just had a quick look at this.

Does it even make sense that bt_foreach_bond calls the callback before BT has been enabled?

The main issue here (in the shell) is that callbacks always assume that ctx_shell has been, which clearly is not the case.

Rather than trying to fix all the different places where ctx_shell can be incorrectly used, we should really just get rid of it as requested by https://github.com/zephyrproject-rtos/zephyr/issues/70945