Crash in OSIntExit - Githubissues

OmicronB commented 3 years ago

I see a crash in OSIntExit on line 372 in os_core.c:

OSTCBHighRdyPtr->CtxSwCtr++;

The reason being that OSTCBHighRdyPtr is 0. It seems this happens when an interrupt occurs while the ready list is being manipulated. For example I had this stack:

-001|OSIntExit() | cpu_sr = 64 -002|BSP_IntHandler(:int_id = 38) | int_id = 38 | cpu_sr = 64 -003|BSP_IntHandlerUSART2() -->|exception -004|OS_PrioRemove(prio = ?) | prio = ? -005|OS_RdyListRemove(p_tcb = 0x20007CA8) | p_tcb = 0x20007CA8 -006|OSTimeDly(:dly = 100, :opt = 0, :p_err = 0x20008B2E) | dly = 100 | opt = 0 | p_err = 0x20008B2E | cpu_sr = 0 -007|OS_StatTask(p_arg = ?) | p_arg = ? | err = OS_ERR_NONE | dly = 100 -008|OS_TaskReturn() | err = 2056 ---|end of frame

The variable OSPrioHighRdy contained 5 in my case, the contents of the ready list was this:

OSRdyList = ( (HeadPtr = 0x0, TailPtr = 0x0, NbrEntries = 0), (HeadPtr = 0x0, TailPtr = 0x0, NbrEntries = 0), (HeadPtr = 0x0, TailPtr = 0x0, NbrEntries = 0), (HeadPtr = 0x0, TailPtr = 0x0, NbrEntries = 0), (HeadPtr = 0x0, TailPtr = 0x0, NbrEntries = 0), (HeadPtr = 0x0, TailPtr = 0x0, NbrEntries = 0), (HeadPtr = 0x0, TailPtr = 0x0, NbrEntries = 0), (HeadPtr = 0x2000B53C, TailPtr = 0x2000B53C, NbrEntries = 1), (HeadPtr = 0x0, TailPtr = 0x0, NbrEntries = 0), (HeadPtr = 0x20007AF8, TailPtr = 0x20007AF8, NbrEntries = 1))

Entry 5 is all zeros. I'm guessing that it had just removed this entry when the interrupt occurred?

mtm0183 commented 3 years ago

Any chance the USART2 interrupt is configured as a non-kernel aware interrupt? The call to OS_RdyListRemove() in os_time.c is called inside a CPU_CRITICAL_ENTER/EXIT block which means the USART2 interrupt shouldn't have occurred until after CPU_CRITICAL_EXIT(). Unless the CPU critical macros are configured incorrectly.

OmicronB commented 3 years ago

It's possible. I'm new to ucos. I used the BSP for the STM32F105 that came with the TCP/IP book for STM32. I had to tweak it here and there to make it work with the current code. I didn't change anything interrupt related though. I am also seeing this crash with the ethernet rx interrupt, so it's not specific to the UART. Any pointers where I need to look to check the configuration of the CPU critical macros or where I can check that the interrupts are correctly set up as kernel aware?

mtm0183 commented 3 years ago

For an ST port, the CPU macros should be ok. cpu_cfg.h sets the CPU_CFG_KA_IPL_BOUNDARY. Make sure any RTOS-aware interrupts are in the KA boundary.

OmicronB commented 3 years ago

Curiously the interrupt priorities are not set in the BSP code. I see this for example:

BSP_IntVectSet(BSP_INT_ID_USART2, BSP_Ser_ISR_Handler);
BSP_IntEn(BSP_INT_ID_USART2);

Seems the priorities are left at their defaults. If I add a line to set the priority to something that is in the KA boundary:

BSP_IntVectSet(BSP_INT_ID_USART2, BSP_Ser_ISR_Handler);
BSP_IntPrioSet(BSP_INT_ID_USART2, 15, CPU_INT_KA);
BSP_IntEn(BSP_INT_ID_USART2);

I run into another issue. At startup I'm writing debug output to USART2. At some point it is pending on a semaphore, I assume it is waiting for the TX queue to drain. When it does this the code hangs in the function "OS_PendListInsertPrio" in os_core.c. It loops forever on line 1335. The element pointed to by p_tcb_next has this contents:

p_tcb_next = 0x2000B534 -> (
    StkPtr = 0x2000B40C,
    ExtPtr = 0x0,
    StkLimitPtr = 0x2000A6CC,
    NamePtr = 0x080179C0,
    NextPtr = 0x0,
    PrevPtr = 0x0,
    TickNextPtr = 0x20007CA8,
    TickPrevPtr = 0x0,
    StkBasePtr = 0x2000A534,
    TaskEntryAddr = 0x08014EFD,
    TaskEntryArg = 0x0,
    PendNextPtr = 0x2000B534,
    PendPrevPtr = 0x2000B534,
    PendObjPtr = 0x20000250,
    PendOn = 6,
    PendStatus = 0,
    TaskState = 2,
    Prio = 7,
    BasePrio = 7,
    MutexGrpHeadPtr = 0x0,
    StkSize = 1024,
    Opt = 3,
    TS = 26460899,
    SemCtr = 0,
    TickRemain = 0,
    TickCtrPrev = 0,
    TimeQuanta = 0,
    TimeQuantaCtr = 100,
    MsgPtr = 0x0,
    MsgSize = 0,
    MsgQ = (InPtr = 0x0, OutPtr = 0x0, NbrEntriesSize = 0, NbrEntries = 0, NbrEntriesMax = 0),
    MsgQPendTime = 0,
    MsgQPendTimeMax = 0,
    RegTbl = (0),
    FlagsPend = 0,
    FlagsRdy = 0,
    FlagsOpt = 0,
    SuspendCtr = 0,
    CPUUsage = 662,
    CPUUsageMax = 662,
    CtxSwCtr = 426,
    CyclesDelta = 2361,
    CyclesStart = 26461766,
    CyclesTotal = 195551,
    CyclesTotalPrev = 477117,
    SemPendTime = 0,
    SemPendTimeMax = 0,
    StkUsed = 103,
    StkFree = 921,
    IntDisTimeMax = 304,
    SchedLockTimeMax = 0,
    DbgPrevPtr = 0x200009CC,
    DbgNextPtr = 0x20007E0C,
    DbgNamePtr = 0x08016F34)

The issue here is that the member field PendNextPtr is pointing to itself hence the loop never terminates.

The stack looks like this at this point:

-000|OS_PendListInsertPrio(:p_pend_list = 0x20000258, :p_tcb = 0x2000B534)
    |  p_pend_list = 0x20000258
    |  p_tcb = 0x2000B534
    |  prio = 7
    |  p_tcb_next = 0x2000B534
-001|OS_Pend(:p_obj = 0x20000250, p_tcb = 0x2000B534, :pending_on = 6, :timeout = 0)
    |  p_obj = 0x20000250
    |  p_tcb = 0x2000B534
    |  pending_on = 6
    |  timeout = 0
-002|OSSemPend(p_sem = 0x20000250, timeout = 0, opt = ?, p_ts = 0x0, p_err = 0x2000B486)
    |  p_sem = 0x20000250
    |  timeout = 0
    |  opt = ?
    |  p_ts = 0x0
    |  p_err = 0x2000B486
    |  cpu_sr = 64
-003|BSP_OS_SemWait(:p_sem = 0x20000250, :dly_ms = 0)
    |  p_sem = 0x20000250
    |  dly_ms = 0
    |  err = 2048
    |  dly_ticks = 0
-004|BSP_Ser_WrByteUnlocked(:c = 83)
    |  c = 83
-005|BSP_Ser_WrByte(c = ?)
    |  c = ?
-006|putchar(ch = ?)
    |  ch = ?
-007|log_info_str(:str = 0x0801707C)
    |  str = 0x0801707F
-008|Mem_OutputUsage(out_fnct = 0x08013055, :p_err = 0x2000B524)
    |  out_fnct = 0x08013055
    |  p_err = 0x2000B524
    |  str = (16, 16, 16, 16, 17, 17, 17, 17, 181, 78)
    |  p_seg = 0x20007AC0
    |  cpu_sr = 0
    |  rem_size = 7128
-009|main_task(p_arg = ?)
    |  p_arg = ?
    |  err_os = OS_ERR_NONE
    |  err_lib = LIB_MEM_ERR_NONE

OmicronB commented 3 years ago

I've debugged this new issue a bit further. The BSP has this function:

void  BSP_Ser_WrByteUnlocked (CPU_INT08U c)
{
    USART_ITConfig(USART2, USART_IT_TC, ENABLE);
    USART_SendData(USART2, c);
    BSP_OS_SemWait(&BSP_SerTxWait, 0);
    USART_ITConfig(USART2, USART_IT_TC, DISABLE);
}

The function BSP_OS_SemWait is just a wrapper for OSSemPend:

CPU_BOOLEAN  BSP_OS_SemWait (BSP_OS_SEM  *p_sem,
                             CPU_INT32U   dly_ms)
{
    OS_ERR      err;
    CPU_INT32U  dly_ticks;

    dly_ticks  = ((dly_ms * DEF_TIME_NBR_mS_PER_SEC) / OSCfg_TickRate_Hz);

    OSSemPend((OS_SEM *)p_sem,
              (OS_TICK )dly_ticks,
              (OS_OPT  )OS_OPT_PEND_BLOCKING,
              (CPU_TS  )0,
              (OS_ERR *)&err);

    if (err != OS_ERR_NONE) {
       return (DEF_FAIL);
    }

    return (DEF_OK);
}

What I see is that at some point the wait returns but the current task is still in the pending list of the semaphore. The next time the function is called the current task is added to the list again and I think this causes the circular reference.

But the question is: how can OSSemPend return with a result of OS_ERR_NONE without removing the current task from the pending list of the semaphore? Curiously this always happens at the exact same point in the execution so it seems to be deterministic.

OmicronB commented 3 years ago

I believe I found the issue. I was using the function Mem_OutputUsage to print the memory usage. This function internally calls CPU_CRITICAL_ENTER() and from within this critical region calls the output function which in turn results in the BSP_Ser_WrByteUnlocked function to eventually be called.

I feel this is a bug in Mem_OutputUsage. It should not call the output function from within a critical region.

weston-embedded / uC-OS3

Crash in OSIntExit #21