kernel.scheduler.multiq: Failed since #35276 ("cooperative scheduling only" special cases removal)

erwango commented 3 years ago

Describe the bug Test tests/kernel/sched/schedule_api/kernel.scheduler.multiq is failed on disco_l475_iot1 since commit 851d14afc8941313a6f3faeb74f84ed73a33429a

Test is failed only when run with twister but is passed when run manually. This is the only board impacted, as far as I know

To Reproduce Steps to reproduce the behavior: twister --test tests/kernel/sched/schedule_api/kernel.scheduler.multiq -p disco_l475_iot1 --device-testing --device-serial /dev/ttyACM0

Expected behavior Test should be passed

Logs and console output

START - test_slice_scheduling
ABCDEFGHI
AB
Assertion failed at WEST_TOPDIR/zephyr/tests/kernel/sched/schedule_api/src/test_slice_scheduling.c:61: thread_tslice: ((tdelta >= expected_slice_min) && (tdelta <= expected_slice_max) && (idx == thread_idx)) is false
C
Assertion failed at WEST_TOPDIR/zephyr/tests/kernel/sched/schedule_api/src/test_slice_scheduling.c:61: thread_tslice: ((tdelta >= expected_slice_min) && (tdelta <= expected_slice_max) && (idx == thread_idx)) is false
D
Assertion failed at WEST_TOPDIR/zephyr/tests/kernel/sched/schedule_api/src/test_slice_scheduling.c:61: thread_tslice: ((tdelta >= expected_slice_min) && (tdelta <= expected_slice_max) && (idx == thread_idx)) is false
E
Assertion failed at WEST_TOPDIR/zephyr/tests/kernel/sched/schedule_api/src/test_slice_scheduling.c:61: thread_tslice: ((tdelta >= expected_slice_min) && (tdelta <= expected_slice_max) && (idx == thread_idx)) is false
F
Assertion failed at WEST_TOPDIR/zephyr/tests/kernel/sched/schedule_api/src/test_slice_scheduling.c:61: thread_tslice: ((tdelta >= expected_slice_min) && (tdelta <= expected_slice_max) && (idx == thread_idx)) is false
G
Assertion failed at WEST_TOPDIR/zephyr/tests/kernel/sched/schedule_api/src/test_slice_scheduling.c:61: thread_tslice: ((tdelta >= expected_slice_min) && (tdelta <= expected_slice_max) && (idx == thread_idx)) is false
H
Assertion failed at WEST_TOPDIR/zephyr/tests/kernel/sched/schedule_api/src/test_slice_scheduling.c:61: thread_tslice: ((tdelta >= expected_slice_min) && (tdelta <= expected_slice_max) && (idx == thread_idx)) is false
I
Assertion failed at WEST_TOPDIR/zephyr/tests/kernel/sched/schedule_api/src/test_slice_scheduling.c:61: thread_tslice: ((tdelta >= expected_slice_min) && (tdelta <= expected_slice_max) && (idx == thread_idx)) is false
Assertion failed at WEST_TOPDIR/zephyr/tests/kernel/sched/schedule_api/src/test_slice_scheduling.c:61: thread_tslice: ((tdelta >= expected_slice_min) && (tdelta <= expected_slice_max) && (idx == thread_idx)) is false
A
Assertion failed at WEST_TOPDIR/zephyr/tests/kernel/sched/schedule_api/src/test_slice_scheduling.c:61: thread_tslice: ((tdelta >= expected_slice_min) && (tdelta <= expected_slice_max) && (idx == thread_idx)) is false
FAIL - test_slice_scheduling in 2.598 seconds
===================================================================

Environment (please complete the following information):

OS: (e.g. Linux, MacOS, Windows): Ubuntu
Toolchain (e.g Zephyr SDK, ...): SDK 0.12.2
Commit SHA or Version used: 851d14afc8941313a6f3faeb74f84ed73a33429a

Additional context Add any other context about the problem here.

erwango commented 3 years ago

@andyross FYI. It's hard for me to make a direct link with your change but it seems that the schedule for this test is a bit tight on this board (cf different status when run with west vs twister) and #35276 may had some small timing changes that impacts the final result.

nashif commented 3 years ago

@andyross Can you please take a look here

andyross commented 3 years ago

Only under twister?! How would it even know? Surely there's something different about the generated binary?

As far as what's going on, this shouldn't have affected much with that test. The only spot in the code where we were hitting the oddball "configure zero preemptible priorities" situation was in one of the philosophers sub-samples. I'll take a look and diff the resulting .config files, maybe that has a clue. I do at least have one of these boards. Do any others fail too, or is it just this one?

andyross commented 3 years ago

Also, maybe it's just a phantom bisect result? Did you try reverting this one change in isolation (which should work) to see if that was the issue?

galak commented 3 years ago

Test is failed only when run with twister but is passed when run manually. This is the only board impacted, as far as I know

@nashif commented that the difference here maybe due to ASSERTs being enabled or not.

andyross commented 3 years ago

CONFIG_ASSERT=y always in tests. I know twister will enable some non-default warnings, it's not supposed to affect generated code AFAIK.

erwango commented 3 years ago

I'll have a new round of test in case this was a friday evening bug. But I confirm that I tried just before and after the commit a certain number of time to validate the bisect result. I'll also compare twister and west binaries, while I'm at it.

erwango commented 3 years ago

One mystery solved: kernel.scheduler.multiq requires use of -DCONF_FILE=prj_multiq.conf when run with west. Using this, I get the same status in twister and manual.

Then I confirm that test is failing on commit 851d14a and failed on the SHA1 before. I've been able to confirm this a number for times.

BUT, I'm not able to reproduce the issue anymore on top of master, and I can see same status on our bench. So I'm closing this point (anyone interested in digging into this a bit more is free to reopen).

FRASTM commented 3 years ago

issues appears again on the SHA1 a02704b SHA1 95e359d

FRASTM commented 3 years ago

It should be noticed that forcing the LL_FLASH_SetLatency(LL_FLASH_LATENCY_2); in the drivers/clock_control/clock_stm32_ll_common.c (end of the stm32_clock_control_init function) is a workaround to pass the test.

affrinpinhero-2356 commented 3 years ago

Hi All,

I have done some debugs, and here are my findings. As per the analysis, I can confirm there was some synchronization issue. We observed int64_t tdelta = k_uptime_delta(&elapsed_slice); was going above 200 for say 201. and at next thread int64_t tdelta = k_uptime_delta(&elapsed_slice); was getting less than 200 for say 199. And after that, all goes to assert. And error occurs.

Currently, this issue was not observed with the latest main.

github-actions[bot] commented 3 years ago

This issue has been marked as stale because it has been open (more than) 60 days with no activity. Remove the stale label or add a comment saying that you would like to have the label removed otherwise this issue will automatically be closed in 14 days. Note, that you can always re-open a closed issue at any time.

ABOSTM commented 3 years ago

Analysis: Assertion failed at WEST_TOPDIR/zephyr/tests/kernel/sched/schedule_api/src/test_slice_scheduling.c:61: thread_tslice: ((tdelta >= expected_slice_min) && (tdelta <= expected_slice_max) && (idx == thread_idx)) is false

Time slice is programmed at 200ms #define SLICE_SIZE 200 Expected time slice is computed

    int64_t expected_slice_min = k_ticks_to_ms_floor64(k_ms_to_ticks_ceil32(SLICE_SIZE));
    int64_t expected_slice_max = k_ticks_to_ms_floor64(k_ms_to_ticks_ceil32(SLICE_SIZE) + 1);

expected_slice_min = 200 expected_slice_max = 200 (CONFIG_SYS_CLOCK_TICKS_PER_SEC=10000)

but measurements show sometimes tdelta =199 so test is failed because 199 (measured time slice) is lower than 200 (expected_slice_min) Similarly measurements show sometimes tdelta =201

Note: compiling with debug option or adding some printf() may lead to test passed, so it seems to me like a timing issue. which could explain the instability observed from one commit to another.

I think this -1 compare to theoretical time slice could be explain because of : test imperfection (measuring time slice thanks to k_uptime_delta() from different threads) or conversion from ms to tick and reverse, or kernel TICK clamping to lower value (https://github.com/zephyrproject-rtos/zephyr/blob/0e1ecb373a1b5b639ea09ebc1cba7a1c44ea5101/drivers/timer/cortex_m_systick.c#L187)

Thus my proposal would be to enlarge time slice criterion PR #38967. With proposed changes, (with disco_l475_iot1) expected_slice_min = 199 expected_slice_max = 201

zephyrproject-rtos / zephyr

kernel.scheduler.multiq: Failed since #35276 ("cooperative scheduling only" special cases removal) #35793