Closed kevinior closed 8 months ago
Hi @kevinior! We appreciate you submitting your first issue for our open-source project. 🌟
Even though I'm a bot, I can assure you that the whole community is genuinely grateful for your time and effort. 🤖💙
@niedzwiecki-dawid would you be able to have a look ?
Is there any way I can reproduce the issue?
Is entropy_stm32_rng_pm_action
ever called?
How are CONFIG_ENTROPY_STM32_ISR_POOL_SIZE
and CONFIG_ENTROPY_STM32_ISR_THRESHOLD
set?
Is there any way I can reproduce the issue?
It seems to be very dependent on timing, so I haven't been able to make a minimal test application.
Is
entropy_stm32_rng_pm_action
ever called?
We don't have CONFIG_PM_DEVICE
enabled, so entropy_stm32_rng_pm_action
doesn't exist.
How are
CONFIG_ENTROPY_STM32_ISR_POOL_SIZE
andCONFIG_ENTROPY_STM32_ISR_THRESHOLD
set?
Both were left at defaults, so 16 and 12 respectively. Here's the relevant part of build/zephyr/.config:
CONFIG_ENTROPY_GENERATOR=y
# CONFIG_ENTROPY_LOG_LEVEL_OFF is not set
# CONFIG_ENTROPY_LOG_LEVEL_ERR is not set
# CONFIG_ENTROPY_LOG_LEVEL_WRN is not set
# CONFIG_ENTROPY_LOG_LEVEL_INF is not set
# CONFIG_ENTROPY_LOG_LEVEL_DBG is not set
CONFIG_ENTROPY_LOG_LEVEL_DEFAULT=y
CONFIG_ENTROPY_LOG_LEVEL=3
CONFIG_ENTROPY_INIT_PRIORITY=50
CONFIG_ENTROPY_STM32_RNG=y
CONFIG_ENTROPY_STM32_THR_POOL_SIZE=8
CONFIG_ENTROPY_STM32_THR_THRESHOLD=4
CONFIG_ENTROPY_STM32_ISR_POOL_SIZE=16
CONFIG_ENTROPY_STM32_ISR_THRESHOLD=12
CONFIG_ENTROPY_HAS_DRIVER=y
Thanks for the info.
Using the debugger it seems like the stm32_rng_isr isn't being called once the problem happens. I can see that the IRQ is enabled and that the RNG CR register has the IE bit set and the SR register has DRDY set.
That looks good. Would you be able to check status of this IRQ in NVIC? I mean output from functions:
NVIC_GetEnableIRQ(80)
NVIC_GetPendingIRQ(80)
NVIC_GetActive(80)
NVIC_GetPriority(80)
Here's the whole NVIC contents from gdb:
(gdb) set $nvic = ((NVIC_Type *) ((0xE000E000UL) + 0x0100UL) )
(gdb) set $irqn = 80
(gdb) set $nvic_prio_bits = 4
(gdb) p/z *$nvic
$5 = {
ISER = {0x0087fd00, 0x20008140, 0x00010000, 0x80300000, 0x00000000, 0x00000000, 0x00000000,
0x00000000},
RESERVED0 = {0x00000000 <repeats 24 times>},
ICER = {0x0087fd00, 0x20008140, 0x00010000, 0x80300000, 0x00000000, 0x00000000, 0x00000000,
0x00000000},
RESERVED1 = {0x00000000 <repeats 24 times>},
ISPR = {0x00040000, 0x20000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000,
0x00000000},
RESERVED2 = {0x00000000 <repeats 24 times>},
ICPR = {0x00040000, 0x20000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000,
0x00000000},
RESERVED3 = {0x00000000 <repeats 24 times>},
IABR = {0x00040000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000,
0x00000000},
RESERVED4 = {0x00000000 <repeats 56 times>},
IP = {0x20 <repeats 18 times>, 0x00, 0x20, 0x20, 0x20, 0x20, 0x20, 0x20, 0x00,
0x20 <repeats 18 times>, 0x00, 0x20 <repeats 105 times>, 0x00 <repeats 90 times>},
RESERVED5 = {0x00000000 <repeats 516 times>, 0x411fc271, 0x00422822, 0x08000000, 0xfa050000,
0x00000010, 0x00070210, 0x00000000, 0x10000000, 0x20f00000, 0x00070000, 0x00000000, 0x00000000,
0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000030, 0x00000200, 0x00100000, 0x00000000,
0x00110030, 0x00000000, 0x01000000, 0x00000000, 0x01101110, 0x02112000, 0x20232231, 0x01111131,
0x01310132, 0x00000000, 0x09000003, 0x8303c003, 0xf00fe019, 0x00000000, 0x00500000, 0x00000000,
0x00001000, 0x00000005, 0x0000000f, 0x0000000f, 0x00000000, 0x0000000f, 0x00000000, 0x0000000f,
0x00000000, 0x0000000f, 0x00000000 <repeats 14 times>, 0x00030003, 0x00000000, 0xffffffff,
0x01000401, 0x00000000 <repeats 64 times>},
STIR = 0x00000000
}
gdb can't see the NVIC_
macros or the __NVIC_
functions they call but I used the source:
(gdb) # __NVIC_GetEnableIRQ
(gdb) p/z ((uint32_t)((($nvic->ISER[(((uint32_t)$irqn) >> 5UL)] & (1UL << (((uint32_t)$irqn) & 0x1FUL))) != 0UL) ? 1UL : 0UL))
$7 = 0x00000001
(gdb) # __NVIC_GetPendingIRQ
(gdb) p/z ((uint32_t)((($nvic->ISPR[(((uint32_t)$irqn) >> 5UL)] & (1UL << (((uint32_t)$irqn) & 0x1FUL))) != 0UL) ? 1UL : 0UL))
$8 = 0x00000000
(gdb) # __NVIC_GetActive(80)
(gdb) p ((($nvic->IABR[(((uint32_t)$irqn) >> 5UL)] & (1UL << (((uint32_t)$irqn) & 0x1FUL))) != 0UL) ? 1UL : 0UL)
$9 = 0
(gdb) # __NVIC_GetPriority(80)
(gdb) p/z (((uint32_t)$nvic->IP[((uint32_t)$irqn)] >> (8U - $nvic_prio_bits)))
$10 = 0x00000002
Great, looks like the IRQ is not pending in NVIC, so me guess is that it is somehow connected with NVIC_ClearPendingIRQ
calls in the driver. I'm going to take a closer look at that.
It is STM32H743
, so CONFIG_STM32H7_DUAL_CORE=n
, right?
It is
STM32H743
, soCONFIG_STM32H7_DUAL_CORE=n
, right?
Yes, CONFIG_STM32H7_DUAL_CORE=n
Would you be able to check if this function generate_from_isr
is ever called?
Would you be able to check if this function
generate_from_isr
is ever called?
I just ran a quick test and I never see that function being called.
Can you check if you hit this line https://github.com/zephyrproject-rtos/zephyr/blob/main/drivers/entropy/entropy_stm32.c#L539 ? I still don't see the bug, it's hard to find it this way.
Can you check if you hit this line https://github.com/zephyrproject-rtos/zephyr/blob/main/drivers/entropy/entropy_stm32.c#L539 ? I still don't see the bug, it's hard to find it this way.
I won't be able to test until tomorrow morning.
I'll also have another try at finding a test case that reproduces the problem without our whole application and hardware.
We have some frequent (94 kHz) and high-priority interrupts, so maybe that's relevant Although they don't do anything with the RNG.
I'll also have another try at finding a test case that reproduces the problem without our whole application and hardware.
Would be great.
Another tests:
Check if commenting out only clock_control_off/on
in entropy_stm32_suspend/resume
makes difference
Using the debugger it seems like the stm32_rng_isr isn't being called once the problem happens. I can see that the IRQ is enabled and that the RNG CR register has the IE bit set and the SR register has DRDY set.
Make sure that RNGEN
bit is also set RNG CR register.
Can you check if you hit this line https://github.com/zephyrproject-rtos/zephyr/blob/main/drivers/entropy/entropy_stm32.c#L539 ? I still don't see the bug, it's hard to find it this way.
It's not hitting that line.
Make sure that RNGEN bit is also set RNG CR register.
(gdb) p/z *((RNG_TypeDef *)0x48021800)
$6 = {
CR = 0x00000008,
SR = 0x00000001,
DR = 0x00070cfa
}
So it looks like RNGEN
is not set.
Check if commenting out only clock_control_off/on in entropy_stm32_suspend/resume makes difference
If I comment those out then the problem doesn't seem to happen. Or at least I've not seen it after several runs of a few minutes each when it normally occurs within the first 10 seconds.
So it looks like RNGEN is not set.
Oooo! I wonder if there is a need to wait a little bit after enabling clock .
res = clock_control_on(dev_data->clock, (clock_control_subsys_t)&dev_cfg->pclken[0]); LL_RNG_Enable(rng);
clock_control_on
enables clock which is used to access the RNG registers. clock_control_on
it is a blocking call, so I think it should be fine. @erwango do you know something more here?
Two additional testes:
clock_control_on
and check if that helps.res
!= 0.
- Please add ~1ms sleep after clock_control_on and check if that helps.
Adding:
res = clock_control_on(dev_data->clock,
(clock_control_subsys_t)&dev_cfg->pclken[0]);
+ if (!k_is_pre_kernel())
+ {
+ k_busy_wait(1000);
+ }
LL_RNG_Enable(rng);
it seems to avoid the problem. I tried a few different delays and it seems like even k_busy_wait(1)
is enough.
- Add additional print if res != 0.
I don't see the additional print even when the problem occurs.
- Check if the RNG clock AHB2 0x04 is enabled once the bug occurs
RCC_AHB2ENR
is 0x00000040 after the problem has occurred, so it looks like RNGEN
is set and the RNG clock is enabled.
RCC_AHB2ENR is 0x00000040 after the problem has occurred, so it looks like RNGEN is set and the RNG clock is enabled.
Interesting, I really feel like we are accessing the RNG registers too fast. If the clock is enabled, that means LL_RNG_Enable
was also called.
Interesting, I really feel like we are accessing the RNG registers too fast.
Yes, it looks as if we try to set RNGEN
too soon after enabling the clock then the register write silently fails. There's information in section 8.5.10 of the STM32H743 reference manual (RM0433 rev 7) about delays when enabling peripheral clocks. There's a recommendation to read back the RCC_xxxxENR
register after writing, to check that the write is not still pending.
There's information in section 8.5.10 of the STM32H743 reference manual (RM0433 rev 7) about delays when enabling peripheral clocks. There's a recommendation to read back the RCC_xxxxENR register after writing, to check that the write is not still pending.
I haven't checked, but such operation should be done in the clock control driver. Not sure how to fix that properly right now.
You can also try barrier_dsync_fence_full
between clock_on and RNG_Enable. There is a chance that instructions are reordered.
You can also try
barrier_dsync_fence_full
between clock_on and RNG_Enable. There is a chance that instructions are reordered.
That works, I don't know if it's because of instruction reordering or just that it introduces enough of a delay to allow the write to complete.
That works, I don't know if it's because of instruction reordering or just that it introduces enough of a delay to allow the write to complete.
I agree, but IMO the barrier makes sense here. What do you think?
That works, I don't know if it's because of instruction reordering or just that it introduces enough of a delay to allow the write to complete.
I agree, but IMO the barrier makes sense here. What do you think?
I don't know enough to have an opinion. Looking at the clock_control_on
documentation it seems like it's not supposed to return before the clock is enabled and ready, so maybe this is really a bug in clock_control?
I agree, but IMO the barrier makes sense here. What do you think?
I'd suggest to update clock_control driver to behave similarly as HAL. For instance:
#define __HAL_RCC_GPIOA_CLK_ENABLE() do { \
__IO uint32_t tmpreg; \
SET_BIT(RCC->AHBENR, RCC_AHBENR_GPIOAEN);\
/* Delay after an RCC peripheral clock enabling */\
tmpreg = READ_BIT(RCC->AHBENR, RCC_AHBENR_GPIOAEN);\
UNUSED(tmpreg); \
} while(0U)
So read back the register after write in clock_control_on/off functions.
Thanks for the suggestion @erwango .
@niedzwiecki-dawid do you want to make the change in clock_control or should I?
You would need to test it anyway, so I think it makes sense that you send a PR. Is that okay?
You would need to test it anyway, so I think it makes sense that you send a PR. Is that okay?
No problem and thanks for all the help with this.
The fix is now merged to main.
Good job :-)
Target platform: STM32H743 Zephyr 3.6.0
Our application uses the Zephyr MQTT client. Since upgrading to 3.6.0 the MQTT thread sometimes hangs on calling
mqtt_connect
. Using a debugger I can see that the thread is stuck waiting onentropy_stm32_rng_data.sem_sync
.The problem happens quite reliably, the application retries
mqtt_connect
every 500 ms and almost always gets stuck in the first 10 attempts. It seems to happen the first timesys_rand32_get
is called when the entropy pool is empty (usually inzsock_connect
called fromfind_available_port
,tcp_derive_rto
ortcp_init_isn
).The problem is sensitive to timing, adding debug output can make it go away.
A manual git bisect (of just drivers/entropy/entropy_stm32.c) showed that the problem appeared in commit 78af1c1.
Commenting out the bodies of the
entropy_stm32_suspend
andentropy_stm32_resume
functions makes the problem go away.Using the debugger it seems like the
stm32_rng_isr
isn't being called once the problem happens. I can see that the IRQ is enabled and that the RNG CR register has the IE bit set and the SR register has DRDY set.Impact
This breaks networking in our application. Once
mqtt_connect
is stuck we can't connect with telnet or send commands with mcumgr (UDP). The only way to recover is to reboot the MCU.