zephyrproject-rtos / zephyr

Primary Git Repository for the Zephyr Project. Zephyr is a new generation, scalable, optimized, secure RTOS for multiple hardware architectures.
https://docs.zephyrproject.org
Apache License 2.0
10.36k stars 6.35k forks source link

Zephyr excecption/error when calling setjmp on esp32 #68524

Closed tenllado closed 3 months ago

tenllado commented 6 months ago

An excepction is triggered when setjmp is called in esp32. This issue was found with the esp_wrover_kit board. Short after the exception the system stops working (this is not noticeable in the minimal example shown below, but can be easily shown if you add some extra code after the setjmp).

A minimal example that shows the problem is the following:

src/main.c:

#include <setjmp.h>
#include <stdio.h>

int main(void)
{
    jmp_buf env;

    printf("setjmp test:\n");

    if (setjmp(env) == 0) {
        printf("long jump set\n");
        longjmp(env, 1);
    } else {
        printf("long done\n");
    }

    return 0;
}

prj.conf:

CONFIG_STDOUT_CONSOLE=y
CONFIG_PRINTK=y
CONFIG_INIT_STACKS=y
CONFIG_MAIN_STACK_SIZE=16384
CONFIG_LOG=y
CONFIG_LOG_MODE_MINIMAL=y

CMakeLists.txt:

cmake_minimum_required(VERSION 3.20.0)
set( CMAKE_EXPORT_COMPILE_COMMANDS ON )
find_package(Zephyr REQUIRED HINTS $ENV{ZEPHYR_BASE})
project(setjmptest)
target_sources(app PRIVATE src/main.c)

Expected behavior: no exception/error when calling setjmp. The same test was performed on an arm based board (disco_l475_iot1) and no exception was observed. The same behavior is expected on the esp32 board.

Impact: showstopper. My project relies on using Lua, which relies on setjmp/longjmp to implement exceptions. Not supporting setjmp on esp32 discards all boards with that soc for the project.

Logs and console output These where obtained for the minimal example described above. I compiled the code with the zephyr-sdk-0.16.3 toolchain and created a workspace for the test using main branch for zephyr. The output corresponds to the command: west espressif monitor, that was run from a Linux host, and the board was a esp_wrover_kit.

Serial port /dev/ttyUSB1 Connecting..... Detecting chip type... Unsupported detection protocol, switching and trying again... Connecting.... Detecting chip type... ESP32 --- idf_monitor on /dev/ttyUSB1 115200 --- --- Quit: Ctrl+] | Menu: Ctrl+T | Help: Ctrl+T followed by Ctrl+H --- ets Jun 8 2016 00:22:57 rst:0x1 (POWERON_RESET),boot:0x1e (SPI_FAST_FLASH_BOOT) configsip: 0, SPIWP:0xee clk_drv:0x00,q_drv:0x00,d_drv:0x00,cs0_drv:0x00,hd_drv:0x00,wp_drv:0x00 mode:DIO, clock div:2 load:0x3fff0030,len:6604 ho 0 tail 12 room 4 load:0x40078000,len:15224 ho 0 tail 12 room 4 load:0x40080400,len:3992 entry 0x40080634 0x40080634: __esp_platform_start at WORKSPACE_PATH/zephyr/soc/xtensa/espressif_esp32/esp32/soc.c:109

I (30) boot: ESP-IDF 67fa60bdff 2nd stage bootloader I (30) boot: compile time 22:14:02 I (30) boot: chip revision: 1 I (33) boot_comm: chip revision: 1, min. bootloader chip revision: 0 I (40) boot.esp32: SPI Speed : 40MHz I (45) boot.esp32: SPI Mode : DIO I (50) boot.esp32: SPI Flash Size : 4MB I (54) boot: Enabling RNG early entropy source... I (60) boot: Partition Table: I (63) boot: ## Label Usage Type ST Offset Length I (70) boot: 0 nvs WiFi data 01 02 00009000 00006000 I (78) boot: 1 phy_init RF data 01 01 0000f000 00001000 I (85) boot: 2 factory factory app 00 00 00010000 00100000 I (93) boot: End of partition table I (97) boot_comm: chip revision: 1, min. application chip revision: 0 I (104) esp_image: segment 0: paddr=00010020 vaddr=00000020 size=0001ch ( 28) I (113) esp_image: segment 1: paddr=00010044 vaddr=3ffb0000 size=006a8h ( 1704) load I (122) esp_image: segment 2: paddr=000106f4 vaddr=3ffb06a8 size=00154h ( 340) load I (130) esp_image: segment 3: paddr=00010850 vaddr=40080000 size=03e50h ( 15952) load I (145) esp_image: segment 4: paddr=000146a8 vaddr=00000000 size=0b990h ( 47504) I (163) esp_image: segment 5: paddr=00020040 vaddr=3f400040 size=0158ch ( 5516) map I (166) esp_image: segment 6: paddr=000215d4 vaddr=00000000 size=0ea44h ( 59972) I (191) esp_image: segment 7: paddr=00030020 vaddr=400d0020 size=039dch ( 14812) map I (198) boot: Loaded app from partition at offset 0x10000 I (198) boot: Disabling RNG early entropy source... Booting Zephyr OS build https://github.com/zephyrproject-rtos/zephyr/commit/60c58fe91801e69aa1d443b5196b934596f6f3fd setjmp test: E: SYSCALL PS 0x60020 PC 0x4005626f E: A0 0x800d03c9 SP 0x3ffe9d80 A2 0 A3 0x3f400d5c E: A4 0x3ffe9d90 A5 0xffffffff A6 0 A7 0xff0fffff E: A8 0x8008237c A9 0x3ffe9d50 A10 0 A11 0x3ffb0008 E: A12 0x3ffb1a24 A13 0x4 A14 0x60020 A15 0x14 E: LBEG 0x4000c46c LEND 0x4000c477 LCOUNT 0 E: ** SAR 0x1b long jump set long done

rftafas commented 6 months ago

Repeating the comment @ https://github.com/zephyrproject-rtos/zephyr/discussions/68478

Chip Revision 1 is not supported. Although this may or may not be related to that. Please, retry it on REV3 or other Xtensa device to check if it still happens.

To further inform: we are removing IDFboot in the upcoming days and new ways to start ESP32 won't allow for REV1.

tenllado commented 6 months ago

I tested it on a new version of the wrover_kit that ships with the REV3 and the problem persists:

$ west espressif monitor Serial port /dev/ttyUSB1 Connecting.... Detecting chip type... Unsupported detection protocol, switching and trying again... Connecting.... Detecting chip type... ESP32 --- idf_monitor on /dev/ttyUSB1 115200 --- --- Quit: Ctrl+] | Menu: Ctrl+T | Help: Ctrl+T followed by Ctrl+H --- ets Jul 29 2019 12:21:46 rst:0x1 (POWERON_RESET),boot:0x1e (SPI_FAST_FLASH_BOOT) configsip: 0, SPIWP:0xee clk_drv:0x00,q_drv:0x00,d_drv:0x00,cs0_drv:0x00,hd_drv:0x00,wp_drv:0x00 mode:DIO, clock div:2 load:0x3fff0030,len:6604 ho 0 tail 12 room 4 load:0x40078000,len:15224 ho 0 tail 12 room 4 load:0x40080400,len:3992 entry 0x40080634 I (31) boot: ESP-IDF 67fa60bdff 2nd stage bootloader I (31) boot: compile time 22:14:02 I (31) boot: chip revision: 3 I (34) boot_comm: chip revision: 3, min. bootloader chip revision: 0 I (41) boot.esp32: SPI Speed : 40MHz I (46) boot.esp32: SPI Mode : DIO I (50) boot.esp32: SPI Flash Size : 4MB I (55) boot: Enabling RNG early entropy source... I (60) boot: Partition Table: I (64) boot: ## Label Usage Type ST Offset Length I (71) boot: 0 nvs WiFi data 01 02 00009000 00006000 I (79) boot: 1 phy_init RF data 01 01 0000f000 00001000 I (86) boot: 2 factory factory app 00 00 00010000 00100000 I (94) boot: End of partition table I (98) boot_comm: chip revision: 3, min. application chip revision: 0 I (105) esp_image: segment 0: paddr=00010020 vaddr=00000020 size=0001ch ( 28) I (113) esp_image: segment 1: paddr=00010044 vaddr=3ffb0000 size=00704h ( 1796) load I (122) esp_image: segment 2: paddr=00010750 vaddr=3ffb0704 size=001b4h ( 436) load I (130) esp_image: segment 3: paddr=0001090c vaddr=40080000 size=04460h ( 17504) load I (146) esp_image: segment 4: paddr=00014d74 vaddr=00000000 size=0b2c4h ( 45764) I (163) esp_image: segment 5: paddr=00020040 vaddr=3f400040 size=013c4h ( 5060) map I (166) esp_image: segment 6: paddr=0002140c vaddr=00000000 size=0ec0ch ( 60428) I (191) esp_image: segment 7: paddr=00030020 vaddr=400d0020 size=04554h ( 17748) map I (200) boot: Loaded app from partition at offset 0x10000 I (200) boot: Disabling RNG early entropy source... I (235) psram: This chip is ESP32-D0WD I (235) spiram: Found 64MBit SPI RAM device I (235) spiram: SPI RAM mode: flash 80m sram 40m 2� Booting Zephyr OS build https://github.com/zephyrproject-rtos/zephyr/commit/60c58fe91801e69aa1d443b5196b934596f6f3fd normal (1-core) mode. setjmp test: E: SYSCALL PS 0x60d20 PC 0x4005626f E: A0 0x800d0491 SP 0x3ffe9d60 A2 0 A3 0 E: ** A4 0x3ffe9d70 A5 0x40083a40 A6 0x60120 A7 0x14 0x40083a40: picolibc_put at WORKSPACE_PATH/zephyr/lib/libc/picolibc/libc-hooks.c:46

E: A8 0x80083aa9 A9 0x3ffe9d30 A10 0 A11 0x3ffb002c E: A12 0x25 A13 0x3ffe9da0 A14 0x4 A15 0x1 E: LBEG 0x4000c46c LEND 0x4000c477 LCOUNT 0 E: SAR 0x1b long jump set long done

rftafas commented 6 months ago

Ok, thanks for the update. It seems related to similar problem on ESP-IDF. From this link:

The solution is to disable interrupts during this code. It is only 6 instructions long, the impact shouldn't be significant.

Either you will need to wait for a bugfix or get hands on using ESP-IDF implementation as base for a fix. First thing is checking if the interrupts are properly disabled.

tenllado commented 6 months ago

Thank you @rftafas, I could try to work on this but I don't know where the implementation of that function is defined. I cannot find it in the zephyr source code. Is it inside the xtensa toolchain? I would need some aditional information to be of any help for you. At least something that helps me get started, where do I have to touch, or at least where to find that information.

rftafas commented 6 months ago

Well, I'm investigating and it seems it has to be included for Xtensa. Maybe others can illuminate you better. I assume the work would be similar to what the NuttX fellows did:

-Create a config option to enable that for ESP32 (or Xtensas in general) -Once enabled, define those functions. -Implement the solution

BTW, some I did googling around showed that this is/was a problem on other architectures as well. Maybe different problems, but around the same functions. There might be some more info looking into those.

tenllado commented 6 months ago

I have been investigating this issue, I want to share my findings, thoughts and questions with you to see if somebody with more knowledge than me on these topic can help to clarify things.

Both setjmp and longjmp seem to be provided by the toolchain. I found very similar implementations for picolib (picolib-setjmp.S) and for newlib (newlib-setgmp.S).

As you may see the implementation of the setjmp in picolib performs a syscall at the begining (SYS_nop is 0):

entry   sp, 16

    /* Flush registers.  */
    mov a4, a2          // save a2 (jmp_buf)
    movi    a2, SYS_nop
    syscall
    mov a2, a4          // restore a2
        ...

while the implementation of newlib calls a function:

setjmp:
    leaf_entry sp

    /* Flush registers.  */
    call4   __window_spill
        ...

The rest is the same.

I tested it with picolib first, the handling for the syscall exception ends with the execution of the function xtensa_excint1_c (in arch/xtensa/core/vector_handlers.c), which checks the cause of the exception and in case of a syscall logs an error message and dumps the stack of the function that performed the syscall:

void *xtensa_excint1_c(int *interrupted_stack)
{
    int cause;
    _xtensa_irq_bsa_t *bsa = (void *)*(int **)interrupted_stack;
...
    void *pc, *print_stack = (void *)interrupted_stack;
    uint32_t depc = 0;

    __asm__ volatile("rsr.exccause %0" : "=r"(cause));
...
    switch (cause) {
...
    case EXCCAUSE_SYSCALL:
        /* Just report it to the console for now */
        LOG_ERR(" ** SYSCALL PS %p PC %p",
            (void *)bsa->ps, (void *)bsa->pc);
        xtensa_dump_stack(interrupted_stack);

        /* Xtensa exceptions don't automatically advance PC,
         * have to skip the SYSCALL instruction manually or
         * else it will just loop forever
         */
        bsa->pc += 3;
        break;
    }
...
#endif /* CONFIG_XTENSA_MMU */

    if (is_dblexc || is_fatal_error) {
        uint32_t ignore;
...
    }

...
    return return_to(interrupted_stack);
}

The varibales is_fatal_error and is_dblexc are both false.

I do not understand if it is just a log message performed for each syscall or if there was some kind of error. In case of the latter, how can I check what went wrong? (Any clue appreciated). If nothing went wrong, what is the purpose of the syscall?

tenllado commented 6 months ago

Regarding the longjmp, the code seems to be compatible with the solution proposed for esp-idf, in which a wrap function is used to replace the begining of the longjmp just to disable exceptions while invalidating the windows on the AR. I tested this with my small example described at the begining of the issue by doing the following changes:

CMakeLists.txt:

cmake_minimum_required(VERSION 3.20.0)
set( CMAKE_EXPORT_COMPILE_COMMANDS ON )
find_package(Zephyr REQUIRED HINTS $ENV{ZEPHYR_BASE})
project(picotest)
target_sources(app PRIVATE src/main.c src/longjmp_patch.S)
zephyr_link_libraries(gcc "-Wl,--wrap=longjmp")

src/longjmp_patch.S:

/*
    Copyright (c) 2001-2006 by Tensilica Inc.

    Permission is hereby granted, free of charge, to any person obtaining
    a copy of this software and associated documentation files (the
    "Software"), to deal in the Software without restriction, including
    without limitation the rights to use, copy, modify, merge, publish,
    distribute, sublicense, and/or sell copies of the Software, and to
    permit persons to whom the Software is furnished to do so, subject to
    the following conditions:

    The above copyright notice and this permission notice shall be included
    in all copies or substantial portions of the Software.

    THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
    EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
    MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
    IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
    CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
    TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
    SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
*/

/*
  This file contains a modified version of the original Xtensa longjmp implementation.
  In this modified version, setting WINDOWSTART = 1 << WINDOWBASE is done inside a critical section.
  This is necessary because after a FreeRTOS context switch in IDF, the values of WINDOWBASE and WINDOWSTART
  are not guaranteed to be the same as before the context switch.
*/

#include <xtensa/corebits.h>

/*
    Replacement of the first instructions of void longjmp (jmp_buf env, int val)
*/

    .align  4
    .literal_position
    .global __wrap_longjmp
    .type   __wrap_longjmp, @function
__wrap_longjmp:
    entry   sp, 16

    /* Deactivate interrupts in order to modify WINDOWBASE and WINDOWSTART. */
    rsr     a7, PS                     /* to be restored after SPILL_ALL_WINDOWS */
    movi    a5, PS_EXCM                /* PS_INTLEVEL_MASK */
    or      a5, a7, a5                 /* get the current INTLEVEL */
    wsr     a5, PS

    /* Invalidate all but the current window;
       set WindowStart to (1 << WindowBase).  */
    rsr a5, WINDOWBASE
    movi    a4, 1
    ssl a5
    sll a4, a4
    wsr a4, WINDOWSTART
    rsync

    /* Activate interrupts again after modifying WINDOWBASE and WINDOWSTART. */
    wsr     a7, PS

    /* Jump back to original longjmp implementation.
        The jump target is the instrucion
            l32i    a0, a2, 64
        of the original code. Hence, the original code's entry instruction and windowstart modification are left
        out.
     */
    movi a0, __real_longjmp + 20
    jx a0

    .size   __wrap_l/*
    Copyright (c) 2001-2006 by Tensilica Inc.

    Permission is hereby granted, free of charge, to any person obtaining
    a copy of this software and associated documentation files (the
    "Software"), to deal in the Software without restriction, including
    without limitation the rights to use, copy, modify, merge, publish,
    distribute, sublicense, and/or sell copies of the Software, and to
    permit persons to whom the Software is furnished to do so, subject to
    the following conditions:

    The above copyright notice and this permission notice shall be included
    in all copies or substantial portions of the Software.

    THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
    EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
    MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
    IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
    CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
    TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
    SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
*/

/*
  This file contains a modified version of the original Xtensa longjmp implementation.
  In this modified version, setting WINDOWSTART = 1 << WINDOWBASE is done inside a critical section.
  This is necessary because after a FreeRTOS context switch in IDF, the values of WINDOWBASE and WINDOWSTART
  are not guaranteed to be the same as before the context switch.
*/

#include <xtensa/corebits.h>

/*
    Replacement of the first instructions of void longjmp (jmp_buf env, int val)
*/

    .align  4
    .literal_position
    .global __wrap_longjmp
    .type   __wrap_longjmp, @function
__wrap_longjmp:
    entry   sp, 16

    /* Deactivate interrupts in order to modify WINDOWBASE and WINDOWSTART. */
    rsr     a7, PS                     /* to be restored after SPILL_ALL_WINDOWS */
    movi    a5, PS_EXCM                /* PS_INTLEVEL_MASK */
    or      a5, a7, a5                 /* get the current INTLEVEL */
    wsr     a5, PS

    /* Invalidate all but the current window;
       set WindowStart to (1 << WindowBase).  */
    rsr a5, WINDOWBASE
    movi    a4, 1
    ssl a5
    sll a4, a4
    wsr a4, WINDOWSTART
    rsync

    /* Activate interrupts again after modifying WINDOWBASE and WINDOWSTART. */
    wsr     a7, PS

    /* Jump back to original longjmp implementation.
        The jump target is the instrucion
            l32i    a0, a2, 64
        of the original code. Hence, the original code's entry instruction and windowstart modification are left
        out.
     */
    movi a0, __real_longjmp + 20
    jx a0

    .size   __wrap_longjmp, . - __wrap_longjmpongjmp, . - __wrap_longjmp

The error log_error message of the setjmp is still printed on the output (as was expected), the longjmp works as before, but in this case the exceptions where blocked. A better test might be required to see if there is any problem with this solution, probably similar to the one proposed for the test in Nuttx (test application).

If this solution is valid, I asume that it should be posted as an issue on the github of the picolibc project, as a patch for the file setjmp.S, right? How long would it take to have that patch included in the zephyr toolchain?

tenllado commented 6 months ago

Finally, regarding newlib. I did the same test replacing picolib with newlib, but I had to comment the printfs in my example because calling printf generates an exception (should this be probably a new issue?). Regarding setjmp and longjmp, the code seems to have linked using the picolib code, because I see the same code as before, and of course the same things happen. I only replaced the CONFIG_PICOLIBC=y by CONFIG_NEWLIB_LIBC=y and commented out the line CONFIG_PICOLIBC_USE_MODULE=y. Do I have to do something else to make sure that the setjmp and longjmp implementations from newlib are used?

sylvioalves commented 6 months ago

@tenllado, It sounds it is working as expected. Anyway, I have added proper longjmp patched version into hal_espressif: https://github.com/zephyrproject-rtos/hal_espressif/pull/270

Let me check about NEWLIB as well.. EDIT1: NEWLIBC and printf is failing. It is not related to setjmp. EDIT2: Fixed, will submit soon.

tenllado commented 6 months ago

@sylvioalves Very interesting, thank you! I am interested in testing it, but not really sure on how to use your bugfix.

Meanwhile I have been trying tu further test this on the wrover_kit, and found another problematic combination. I post here the code example:

#include <setjmp.h>
#include <stdio.h>
#include <zephyr/shell/shell.h>

void setjmp_test(void)
{
    jmp_buf env;

    if (setjmp(env) == 0) {
    //  printf("long jump set\n");
        longjmp(env, 1);
    } else {
    //  printf("long done\n");
    }
}

static int cmd_hello(const struct shell *shell, size_t argc, char **argv)
{
    if (argc < 2) {
        shell_fprintf(shell, SHELL_NORMAL, "usage %s name\n", argv[0]);
        return -1;
    }

    shell_fprintf(shell, SHELL_NORMAL, "hello %s\n", argv[1]);
    return 0;
}

int main(void)
{
    setjmp_test();
    return 0;
}

SHELL_CMD_REGISTER(hello, NULL, "greetings command", cmd_hello);

with prj.conf:

CONFIG_STDOUT_CONSOLE=y
CONFIG_PRINTK=y
CONFIG_INIT_STACKS=y
CONFIG_MAIN_STACK_SIZE=16384
# CONFIG_LOG=y
# CONFIG_LOG_MODE_MINIMAL=y

CONFIG_POSIX_API=y
CONFIG_PICOLIBC=y
CONFIG_PICOLIBC_USE_MODULE=y

CONFIG_DEBUG=y
CONFIG_NO_OPTIMIZATIONS=y
CONFIG_DEBUG_OPTIMIZATIONS=y

CONFIG_SHELL=y
CONFIG_SHELL_BACKENDS=y
CONFIG_SHELL_STACK_SIZE=4096

CMakeLists.txt:

cmake_minimum_required(VERSION 3.20.0)

set( CMAKE_EXPORT_COMPILE_COMMANDS ON )

find_package(Zephyr REQUIRED HINTS $ENV{ZEPHYR_BASE})

project(picotest)

if (CONFIG_PICOLIBC)
    target_sources(app PRIVATE src/main.c src/longjmp_patch.S)
    zephyr_link_libraries(gcc "-Wl,--wrap=longjmp")
else()
    target_sources(app PRIVATE src/main.c)
endif()

In this example the shell command is not working, neither with the regular nor the wrap version of the longjmp. I debbuged it and gets hanged in the xtensa_asm2_util.S file, in line 480, signaling an unhandled double expression: 1: break 1, 4

Not sure why for the moment.

sylvioalves commented 6 months ago

I guess it is not working because the below is commented?

# CONFIG_LOG=y
# CONFIG_LOG_MODE_MINIMAL=y
tenllado commented 6 months ago

If you uncomment the CONFIG_LOG lines in the prj.conf you get the error log from setjmp, but the program does not hang, and the shell command works.

I do not understand why with the LOG service affect the program behavior this way. It is very strange, isn't it?

dcpleung commented 6 months ago

This needs to be filed as a bug in picolibc. SYS_nop is not a syscall number. It is for simulator as it is defined in simcall.h in the Xtensa HAL.

andyross commented 6 months ago

I found my way here from the "squash the exception logging" patch in #69096 too. So... this is actually deep. Bear with me:

Actually the SYS_nop is a syscall number. It's a linux syscall number defined for a particular Xtensa subdevice: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/xtensa/platforms/iss/include/platform/simcall-iss.h?h=v6.8-rc5#n11

Picolibc and newlib both just picked this up from the original source, surely, which had been validated on that device. It means nothing here in Zephyr, and the only reason it ever built is that the symbol name collides with the enumerant Daniel is pointing at in the simcall simulation layer.

So what it is trying to do making a noop syscall? It's clear from the comments that it's trying to get the windows registers spilled to the stack for the benefit of the setjmp context, so that it can be pickled in a known state before the hand-coded context switch in longjmp(). Presumably, and unsurprisingly, in Linux making a system call has the effect of spilling all your register windows.

But that isn't even true with Zephyr! We have this crazy cross-stack call gadget that allows shallow ISRs and system calls to re-use the windows register state of the calling user process, or failing that to spill them naturally as part of the call tree of the called code. So this wouldn't work even if it was doing what it expected to do, which it doesn't.

But there's actually code in the kernel to handle this. There's a hook to the existing window spill code available as a xtensa_spill_reg_windows symbol, which must be called with CALL0 (because you can't muck with window state by making windowed calls). It doesn't require any scratch registers beyond the fact that the call clobbers A0 with the return address, so reading above, it's probably OK to replace the syscall sequence with:

mov a2, a0
call0 xtensa_spill_reg_windows
mov a0, a2

Maybe. Untested, and even writing such a test would be pretty hairy. As I understand it, we don't even have a failing case (i.e. the current syscall handler ends up spilling enough of the user process anyway)?

tenllado commented 6 months ago

@andyross, I do not have a specifi test case, but have a lua interpreter that fails on the xtensa boards. Probably a test for this should be independent of the lua library, something similar to what they used in Nutx?

andyross commented 6 months ago

Yeah, the requirement would be arch-specific validation that sets known register state, does a synchronous setjmp without any intervening function calls (to avoid spilling windows), then longjmps back in and validates that everything worked. Not awful but xtensa-specific and nontrivial. But the fix really shouldn't be any harder than those three instructions.

Basically this bug should be renamed to "{new,pico}libc setjmp/longjmp doesn't work on xtensa"

sylvioalves commented 6 months ago

It also happens when minimal libc is used btw.

dcpleung commented 6 months ago

??? Minimal libc does not provide setjmp AFAICT.

sylvioalves commented 6 months ago

@dcpleung, you are correct. It worked in here because it was using rom setjmp call instead.

dcpleung commented 6 months ago

Does the ROM one uses syscall also?

sylvioalves commented 6 months ago

Does the ROM one uses syscall also?

Yes, it does.

sylvioalves commented 6 months ago

@andyross, I read exc-syscall-handler.S after your comment..

 * The SYSCALL instruction is typically used to implement system calls.
 * By convention, register a2 identifies the requested system call.
 * Typically, other parameters are passed in registers a3 and up,
 * and results are returned in a2.
 *
 * The Xtensa windowed ABI reserves the value zero of register a2
 * as a request to force register windows to the stack.  The call0 ABI,
 * which has no equivalent operation, reserves this value as a no-op.
 *
 * Generally, only code that traverses the stack in unusual ways needs
 * to force (spill) register windows to the stack.  In generic C or C++,
 * there are four cases, and they all use the standard SYSCALL mechanism:
 *
 * 1. C++ exceptions
 * 2. setjmp and longjmp
 * 3. functions using the GNU extension "__builtin_return_address"
 * 4. functions using the GNU extension "nonlocal goto"
andyross commented 6 months ago

@sylvioalves: Right, but that's describing a Linux system call behavior (or arguably a Cadence XTOS thing, I guess). We don't work like that. FWIW: note that you don't need to do any OS trampolining anyway (one of my big issues with the HAL is how wasteful it is). Just making a series of nested function calls and touching A15 in each is probably faster and more portable. Or you can to an even smaller list of ROTW instructions, which is what the SPILL_ALL_WINDOWS code does I mentioned earlier, see notes at https://github.com/zephyrproject-rtos/zephyr/blob/main/arch/xtensa/include/xtensa_asm2_s.h#L19

github-actions[bot] commented 4 months ago

This issue has been marked as stale because it has been open (more than) 60 days with no activity. Remove the stale label or add a comment saying that you would like to have the label removed otherwise this issue will automatically be closed in 14 days. Note, that you can always re-open a closed issue at any time.

luispimo commented 1 month ago

Any news about this issue? It's a pity that setjmp doesn't work on this type of architecture.