entropy_stm32_rng_get_entropy sometimes blocks forever in Zephyr 3.6.0

kevinior commented 8 months ago

Target platform: STM32H743 Zephyr 3.6.0

Our application uses the Zephyr MQTT client. Since upgrading to 3.6.0 the MQTT thread sometimes hangs on calling mqtt_connect. Using a debugger I can see that the thread is stuck waiting on entropy_stm32_rng_data.sem_sync.

The problem happens quite reliably, the application retries mqtt_connect every 500 ms and almost always gets stuck in the first 10 attempts. It seems to happen the first time sys_rand32_get is called when the entropy pool is empty (usually in zsock_connect called from find_available_port, tcp_derive_rto or tcp_init_isn).

The problem is sensitive to timing, adding debug output can make it go away.

A manual git bisect (of just drivers/entropy/entropy_stm32.c) showed that the problem appeared in commit 78af1c1.

Commenting out the bodies of the entropy_stm32_suspend and entropy_stm32_resume functions makes the problem go away.

Using the debugger it seems like the stm32_rng_isr isn't being called once the problem happens. I can see that the IRQ is enabled and that the RNG CR register has the IE bit set and the SR register has DRDY set.

Impact

This breaks networking in our application. Once mqtt_connect is stuck we can't connect with telnet or send commands with mcumgr (UDP). The only way to recover is to reboot the MCU.

github-actions[bot] commented 8 months ago

Hi @kevinior! We appreciate you submitting your first issue for our open-source project. 🌟

Even though I'm a bot, I can assure you that the whole community is genuinely grateful for your time and effort. 🤖💙

erwango commented 8 months ago

@niedzwiecki-dawid would you be able to have a look ?

niedzwiecki-dawid commented 8 months ago

Is there any way I can reproduce the issue? Is entropy_stm32_rng_pm_action ever called? How are CONFIG_ENTROPY_STM32_ISR_POOL_SIZE and CONFIG_ENTROPY_STM32_ISR_THRESHOLD set?

kevinior commented 8 months ago

Is there any way I can reproduce the issue?

It seems to be very dependent on timing, so I haven't been able to make a minimal test application.

Is entropy_stm32_rng_pm_action ever called?

We don't have CONFIG_PM_DEVICE enabled, so entropy_stm32_rng_pm_action doesn't exist.

How are CONFIG_ENTROPY_STM32_ISR_POOL_SIZE and CONFIG_ENTROPY_STM32_ISR_THRESHOLD set?

Both were left at defaults, so 16 and 12 respectively. Here's the relevant part of build/zephyr/.config:

CONFIG_ENTROPY_GENERATOR=y
# CONFIG_ENTROPY_LOG_LEVEL_OFF is not set
# CONFIG_ENTROPY_LOG_LEVEL_ERR is not set
# CONFIG_ENTROPY_LOG_LEVEL_WRN is not set
# CONFIG_ENTROPY_LOG_LEVEL_INF is not set
# CONFIG_ENTROPY_LOG_LEVEL_DBG is not set
CONFIG_ENTROPY_LOG_LEVEL_DEFAULT=y
CONFIG_ENTROPY_LOG_LEVEL=3
CONFIG_ENTROPY_INIT_PRIORITY=50
CONFIG_ENTROPY_STM32_RNG=y
CONFIG_ENTROPY_STM32_THR_POOL_SIZE=8
CONFIG_ENTROPY_STM32_THR_THRESHOLD=4
CONFIG_ENTROPY_STM32_ISR_POOL_SIZE=16
CONFIG_ENTROPY_STM32_ISR_THRESHOLD=12
CONFIG_ENTROPY_HAS_DRIVER=y

niedzwiecki-dawid commented 8 months ago

Thanks for the info.

Using the debugger it seems like the stm32_rng_isr isn't being called once the problem happens. I can see that the IRQ is enabled and that the RNG CR register has the IE bit set and the SR register has DRDY set.

That looks good. Would you be able to check status of this IRQ in NVIC? I mean output from functions:

NVIC_GetEnableIRQ(80)
NVIC_GetPendingIRQ(80)
NVIC_GetActive(80)
NVIC_GetPriority(80)

kevinior commented 8 months ago

Here's the whole NVIC contents from gdb:

(gdb) set $nvic = ((NVIC_Type *) ((0xE000E000UL) + 0x0100UL) )
(gdb) set $irqn = 80
(gdb) set $nvic_prio_bits = 4
(gdb) p/z *$nvic
$5 = {
  ISER = {0x0087fd00, 0x20008140, 0x00010000, 0x80300000, 0x00000000, 0x00000000, 0x00000000, 
    0x00000000},
  RESERVED0 = {0x00000000 <repeats 24 times>},
  ICER = {0x0087fd00, 0x20008140, 0x00010000, 0x80300000, 0x00000000, 0x00000000, 0x00000000, 
    0x00000000},
  RESERVED1 = {0x00000000 <repeats 24 times>},
  ISPR = {0x00040000, 0x20000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 
    0x00000000},
  RESERVED2 = {0x00000000 <repeats 24 times>},
  ICPR = {0x00040000, 0x20000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 
    0x00000000},
  RESERVED3 = {0x00000000 <repeats 24 times>},
  IABR = {0x00040000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 
    0x00000000},
  RESERVED4 = {0x00000000 <repeats 56 times>},
  IP = {0x20 <repeats 18 times>, 0x00, 0x20, 0x20, 0x20, 0x20, 0x20, 0x20, 0x00, 
    0x20 <repeats 18 times>, 0x00, 0x20 <repeats 105 times>, 0x00 <repeats 90 times>},
  RESERVED5 = {0x00000000 <repeats 516 times>, 0x411fc271, 0x00422822, 0x08000000, 0xfa050000, 
    0x00000010, 0x00070210, 0x00000000, 0x10000000, 0x20f00000, 0x00070000, 0x00000000, 0x00000000, 
    0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000030, 0x00000200, 0x00100000, 0x00000000, 
    0x00110030, 0x00000000, 0x01000000, 0x00000000, 0x01101110, 0x02112000, 0x20232231, 0x01111131, 
    0x01310132, 0x00000000, 0x09000003, 0x8303c003, 0xf00fe019, 0x00000000, 0x00500000, 0x00000000, 
    0x00001000, 0x00000005, 0x0000000f, 0x0000000f, 0x00000000, 0x0000000f, 0x00000000, 0x0000000f, 
    0x00000000, 0x0000000f, 0x00000000 <repeats 14 times>, 0x00030003, 0x00000000, 0xffffffff, 
    0x01000401, 0x00000000 <repeats 64 times>},
  STIR = 0x00000000
}

gdb can't see the NVIC_ macros or the __NVIC_ functions they call but I used the source:

(gdb) # __NVIC_GetEnableIRQ
(gdb) p/z ((uint32_t)((($nvic->ISER[(((uint32_t)$irqn) >> 5UL)] & (1UL << (((uint32_t)$irqn) & 0x1FUL))) != 0UL) ? 1UL : 0UL))
$7 = 0x00000001

(gdb) # __NVIC_GetPendingIRQ
(gdb) p/z ((uint32_t)((($nvic->ISPR[(((uint32_t)$irqn) >> 5UL)] & (1UL << (((uint32_t)$irqn) & 0x1FUL))) != 0UL) ? 1UL : 0UL))
$8 = 0x00000000

(gdb) # __NVIC_GetActive(80)
(gdb) p ((($nvic->IABR[(((uint32_t)$irqn) >> 5UL)] & (1UL << (((uint32_t)$irqn) & 0x1FUL))) != 0UL) ? 1UL : 0UL)
$9 = 0

(gdb) # __NVIC_GetPriority(80)
(gdb) p/z (((uint32_t)$nvic->IP[((uint32_t)$irqn)]                >> (8U - $nvic_prio_bits)))
$10 = 0x00000002

niedzwiecki-dawid commented 8 months ago

Great, looks like the IRQ is not pending in NVIC, so me guess is that it is somehow connected with NVIC_ClearPendingIRQ calls in the driver. I'm going to take a closer look at that.

niedzwiecki-dawid commented 8 months ago

It is STM32H743, so CONFIG_STM32H7_DUAL_CORE=n, right?

kevinior commented 8 months ago

It is STM32H743, so CONFIG_STM32H7_DUAL_CORE=n, right?

Yes, CONFIG_STM32H7_DUAL_CORE=n

niedzwiecki-dawid commented 8 months ago

Would you be able to check if this function generate_from_isr is ever called?

kevinior commented 8 months ago

Would you be able to check if this function generate_from_isr is ever called?

I just ran a quick test and I never see that function being called.

niedzwiecki-dawid commented 8 months ago

Can you check if you hit this line https://github.com/zephyrproject-rtos/zephyr/blob/main/drivers/entropy/entropy_stm32.c#L539 ? I still don't see the bug, it's hard to find it this way.

kevinior commented 8 months ago

Can you check if you hit this line https://github.com/zephyrproject-rtos/zephyr/blob/main/drivers/entropy/entropy_stm32.c#L539 ? I still don't see the bug, it's hard to find it this way.

I won't be able to test until tomorrow morning.

I'll also have another try at finding a test case that reproduces the problem without our whole application and hardware.

We have some frequent (94 kHz) and high-priority interrupts, so maybe that's relevant Although they don't do anything with the RNG.

niedzwiecki-dawid commented 8 months ago

I'll also have another try at finding a test case that reproduces the problem without our whole application and hardware.

Would be great.

Another tests:

Check if commenting out only clock_control_off/on in entropy_stm32_suspend/resume makes difference
Using the debugger it seems like the stm32_rng_isr isn't being called once the problem happens. I can see that the IRQ is enabled and that the RNG CR register has the IE bit set and the SR register has DRDY set.

Make sure that RNGEN bit is also set RNG CR register.

kevinior commented 8 months ago

Can you check if you hit this line https://github.com/zephyrproject-rtos/zephyr/blob/main/drivers/entropy/entropy_stm32.c#L539 ? I still don't see the bug, it's hard to find it this way.

It's not hitting that line.

Make sure that RNGEN bit is also set RNG CR register.

(gdb) p/z *((RNG_TypeDef *)0x48021800)
$6 = {
  CR = 0x00000008,
  SR = 0x00000001,
  DR = 0x00070cfa
}

So it looks like RNGEN is not set.

Check if commenting out only clock_control_off/on in entropy_stm32_suspend/resume makes difference

If I comment those out then the problem doesn't seem to happen. Or at least I've not seen it after several runs of a few minutes each when it normally occurs within the first 10 seconds.

niedzwiecki-dawid commented 8 months ago

So it looks like RNGEN is not set.

Oooo! I wonder if there is a need to wait a little bit after enabling clock .

res = clock_control_on(dev_data->clock, (clock_control_subsys_t)&dev_cfg->pclken[0]); LL_RNG_Enable(rng);

clock_control_on enables clock which is used to access the RNG registers. clock_control_on it is a blocking call, so I think it should be fine. @erwango do you know something more here?

Two additional testes:

Please add ~1ms sleep after clock_control_on and check if that helps.
Add additional print if res != 0.

niedzwiecki-dawid commented 8 months ago

Check if the RNG clock AHB2 0x04 is enabled once the bug occurs

kevinior commented 8 months ago

Please add ~1ms sleep after clock_control_on and check if that helps.

Adding:

        res = clock_control_on(dev_data->clock,
                        (clock_control_subsys_t)&dev_cfg->pclken[0]);
+       if (!k_is_pre_kernel())
+       {
+               k_busy_wait(1000);
+       }
        LL_RNG_Enable(rng);

it seems to avoid the problem. I tried a few different delays and it seems like even k_busy_wait(1) is enough.

Add additional print if res != 0.

I don't see the additional print even when the problem occurs.

Check if the RNG clock AHB2 0x04 is enabled once the bug occurs

RCC_AHB2ENR is 0x00000040 after the problem has occurred, so it looks like RNGEN is set and the RNG clock is enabled.

niedzwiecki-dawid commented 8 months ago

RCC_AHB2ENR is 0x00000040 after the problem has occurred, so it looks like RNGEN is set and the RNG clock is enabled.

Interesting, I really feel like we are accessing the RNG registers too fast. If the clock is enabled, that means LL_RNG_Enable was also called.

kevinior commented 8 months ago

Interesting, I really feel like we are accessing the RNG registers too fast.

Yes, it looks as if we try to set RNGEN too soon after enabling the clock then the register write silently fails. There's information in section 8.5.10 of the STM32H743 reference manual (RM0433 rev 7) about delays when enabling peripheral clocks. There's a recommendation to read back the RCC_xxxxENR register after writing, to check that the write is not still pending.

niedzwiecki-dawid commented 8 months ago

There's information in section 8.5.10 of the STM32H743 reference manual (RM0433 rev 7) about delays when enabling peripheral clocks. There's a recommendation to read back the RCC_xxxxENR register after writing, to check that the write is not still pending.

I haven't checked, but such operation should be done in the clock control driver. Not sure how to fix that properly right now.

niedzwiecki-dawid commented 8 months ago

You can also try barrier_dsync_fence_full between clock_on and RNG_Enable. There is a chance that instructions are reordered.

kevinior commented 8 months ago

You can also try barrier_dsync_fence_full between clock_on and RNG_Enable. There is a chance that instructions are reordered.

That works, I don't know if it's because of instruction reordering or just that it introduces enough of a delay to allow the write to complete.

niedzwiecki-dawid commented 8 months ago

That works, I don't know if it's because of instruction reordering or just that it introduces enough of a delay to allow the write to complete.

I agree, but IMO the barrier makes sense here. What do you think?

kevinior commented 8 months ago

That works, I don't know if it's because of instruction reordering or just that it introduces enough of a delay to allow the write to complete.

I agree, but IMO the barrier makes sense here. What do you think?

I don't know enough to have an opinion. Looking at the clock_control_on documentation it seems like it's not supposed to return before the clock is enabled and ready, so maybe this is really a bug in clock_control?

erwango commented 8 months ago

I agree, but IMO the barrier makes sense here. What do you think?

I'd suggest to update clock_control driver to behave similarly as HAL. For instance:

#define __HAL_RCC_GPIOA_CLK_ENABLE()   do { \
                                        __IO uint32_t tmpreg; \
                                        SET_BIT(RCC->AHBENR, RCC_AHBENR_GPIOAEN);\
                                        /* Delay after an RCC peripheral clock enabling */\
                                        tmpreg = READ_BIT(RCC->AHBENR, RCC_AHBENR_GPIOAEN);\
                                        UNUSED(tmpreg); \
                                      } while(0U)

So read back the register after write in clock_control_on/off functions.

kevinior commented 8 months ago

Thanks for the suggestion @erwango .

@niedzwiecki-dawid do you want to make the change in clock_control or should I?

niedzwiecki-dawid commented 8 months ago

You would need to test it anyway, so I think it makes sense that you send a PR. Is that okay?

kevinior commented 8 months ago

You would need to test it anyway, so I think it makes sense that you send a PR. Is that okay?

No problem and thanks for all the help with this.

kevinior commented 8 months ago

The fix is now merged to main.

niedzwiecki-dawid commented 8 months ago

Good job :-)

zephyrproject-rtos / zephyr

entropy_stm32_rng_get_entropy sometimes blocks forever in Zephyr 3.6.0 #70376