zephyriot / zep-jira14

0 stars 0 forks source link

schedule_api test fails on ARM with tickless kernel enabled #2053

Closed nashif closed 7 years ago

nashif commented 7 years ago

Reported by Andrew Boie:

Steps to reproduce:

Add the following lines to prj.conf:

CONFIG_DEBUG=y
CONFIG_SYS_POWER_MANAGEMENT=y
CONFIG_TICKLESS_IDLE=y
CONFIG_TICKLESS_KERNEL=y

Run the test for qemu_cortex_m3:

tc_start() - test_slice_reset
thread[0] elapsed slice 1392, expected 100

    Assertion failed at /projects/zephyr/tests/kernel/threads_scheduling/schedule_api/src/test_sched_timeslice_reset.c:38: thread_tslice:  t <= expected_slice[thread_idx] is false
thread[0] elapsed slice 1, expected 100
thread[1] elapsed slice 199, expected 200

(Imported from Jira ZEP-2217)

nashif commented 7 years ago

by Youvedeep Singh:

Hi Andrew Boie Can you please help me to test this quickly to see if this is same issue as GH-2054 or not. Currently I do not have access to ARM Board.

What need to be done :- File Name : "kernel/sched.c" In function : - _update_time_slice_before_swap" Just comment out following condition ;- "// if (!remaining || (_time_slice_duration < remaining)) {" (I am writing this in this format as formatting of patch is getting messed up) :)

After commenting out can you please retest this once,

nashif commented 7 years ago

by Youvedeep Singh:

With Power Management Disabled ==> Test Pass With Tickless Idle Enabled ==> Test Pass With Tickless Kernel Enabled ==> Test Fail

With default TICKLESS_KERNEL enbaled this is getitng Crash @ test_time_slicing_preemptible Here is the log i got:- Running test suite test_threads_scheduling tc_start() - test_priority_cooperative

PASS - test_priority_cooperative. tc_start() - test_priority_preemptible

PASS - test_priority_preemptible. tc_start() - test_yield_cooperative

PASS - test_yield_cooperative. tc_start() - test_sleep_cooperative

PASS - test_sleep_cooperative. tc_start() - test_sleep_wakeup_preemptible

PASS - test_sleep_wakeup_preemptible. tc_start() - test_time_slicing_preemptible

Assertion failed at /home/ysingh/IOT/Zephyr/Source_Code/github/test2/zephyr/tests/kernel/threads_scheduling/schedule_api/src/test_sched_timeslice_and_lock.c:15e

=================================================================== FAIL - test_time_slicing_preemptible. tc_start() - test_time_slicing_disable_preemptible

PASS - test_time_slicing_disable_preemptible. tc_start() - test_lock_preemptible USAGE FAULT Executing thread ID (thread): 0x20070220 Faulting instruction address: 0x82680 No coprocessor instructions Fatal fault in thread 0x20070220! Aborting. USAGE FAULT Executing thread ID (thread): 0x20070220 Faulting instruction address: 0x82680 No coprocessor instructions Fatal fault in thread 0x20070220! Aborting. USAGE FAULT Executing thread ID (thread): 0x20070220 Faulting instruction address: 0x82680 No coprocessor instructions Fatal fault in thread 0x20070220! Aborting. USAGE FAULT Executing thread ID (thread): 0x20070220 Faulting instruction address: 0x82680 No coprocessor instructions Fatal fault in thread 0x20070220! Aborting. USAGE FAULT Executing thread ID (thread): 0x20070220 Faulting instruction address: 0x82680 No coprocessor instructions Fatal fault in thread 0x20070220! Aborting. USAGE FAULT Executing thread ID (thread): 0x20070220 Faulting instruction address: 0x82680 No coprocessor instructions Fatal fault in thread 0x20070220! Aborting.

I am debugging his issue further

nashif commented 7 years ago

by Youvedeep Singh:

Further i put some debug prints, looks like crash seems to be comming from void " test_time_slicing_preemptible" @ zassert_true(tdata[1].executed == 1, NULL).

I put 2 prints one just before this and another just after this. The print ahead of this is getting printed but print after this is not getting printed when TICKLESS_KERNEL is enabled. Not sure if this is the issue, I will dig this further.

nashif commented 7 years ago

by Youvedeep Singh:

Further Debug on this issue. Here are 2 issues :-

  1. test_time_slicing_preemptible :- If i comment "zassert_true(tdata[1].executed == 1, NULL);", I am not hitting the issue. As of now I am not focusing this issue.

  2. Failure at "test_slice_reset". In this case time slice given to each thread of quual priority is not same, While debugging identified that time slide is not set properly, even though time slide set should be 200, but inside "_nano_sys_clock_tick_announce" number of ticks passed is 199 due to which condition is not getting satisfied for setting the time (remaining and next_to both are staying 0). Reason for next_to become 0 ? Inside "handle_time_slicing"
    _time_slice_elapsed = 199 and _time_slice_duration = 200 due to which , _time_slice_elapsed >= _time_slice_duration is not getting satisfied and and next_ts becomes 0. This looks like issue inside "cortex_m_systick.c", i will continue debug this tomorrow.

nashif commented 7 years ago

by Youvedeep Singh:

Further It looks like "_sys_clock_tick_count" is not getting updated correctly, which is leading to may crazy issues. I dumped the value of systick to look

Loaded Value of ticks 199 (Max ticks), So we were expecting that value of ticks will upltimately on timer expire should be at least 199, but it is coming out to be ~32. I am suspecting that "Loaded Value f" is not getting set when timer is rolling back and again stats counting from Load value. So when we are reaching at interrupt, difference between load value and Current value is not significant and it is leading to low value of clock count.

This is evident from following dump.

**7 Loaded Value ff10e0 Current 7a807

***7 Loaded Value f ff10e0 Current 4ccc2

**7 Loaded Value f ff10e0### _sys_clock_tick_count 32 // TIll now ~195 ticks have expired. _nano_sys_clock_tick_announce*** //Timer got expired here and it generated and interrupt _set_time====> idle_original_ticks 1 time 1 //request to prog for one tick. =======_get_elapsed_clock_time===

***4 Loaded Value ff10e0 Current ff10c0 //It can be seen that timer has expired and Current value is Loaded with previous and it counts downward 1f182

_sys_clock_tick_count 32 //Total number of systicks

nashif commented 7 years ago

by Andrew Boie:

When we resolve this issue, we need to be sure to add an additional test case to schedule_api's testcase.yaml that does this test with tickless kernel enabled.

nashif commented 7 years ago

by Youvedeep Singh:

Further I was able to root cause why "_sys_clock_tick_count" was not getting updated. The reason was "SysTick_CTRL_COUNTFLAG_Msk" is clear on read, So this was getting cleared when it was read while reading it from sched.c (while updating tmp_sys_clock_tick_count) _set_time====> idle_original_ticks 199 time 200 overflow 0 =======updated 0===_sys_clock_tick_count 0 =======updated 0===_sys_clock_tick_count 199 =======updated 0===_sys_clock_tick_count 397

After this, still I am facing another issues related to clock looks like k_cycle_get32 is getting expired sooner ztest is getting failed. I will continue looking into this.

nashif commented 7 years ago

by Youvedeep Singh:

sure Andrew i will add this test in "testcase.yaml " on getting proper fix for this test.

nashif commented 7 years ago

by Youvedeep Singh:

There were few issues around tickless kernel related to systick Cortex m timer and a scheduler issue which was impacting all the timers.

Timer Systick : Handle Systick timer rollback in tickless kernel Case **Issues related to cortex_m_systick **

  1. System time keeping was not happening properly. When systick was getting expired it's overflow bit was get. This bit was clear on read and it was getting cleared in multiple places like sysTickstop, get current time etc. Due to which while updating _sys_clock_tick_count, it was getting updated with SysTick->LOAD - SysTick->VAL instead of SysTick->LOAD. Due to which kernel time keeping was not happening properly. This patch tracks overflow occurrence and update _sys_clock_tick_count corretly.

Before :- Running test suite test_threads_scheduling tc_start() - test_priority_cooperative

PASS - test_priority_cooperative. tc_start() - test_priority_preemptible

PASS - test_priority_preemptible. tc_start() - test_yield_cooperative

PASS - test_yield_cooperative. tc_start() - test_sleep_cooperative

PASS - test_sleep_cooperative. tc_start() - test_sleep_wakeup_preemptible

PASS - test_sleep_wakeup_preemptible. tc_start() - test_time_slicing_preemptible

Assertion failed at /home/ysingh/IOT/Zephyr/Source_Code/github/test2/zephyr/tests/kernel/threads_scheduling/schedule_api/src/test_sched_timeslice_and_lock.c:15e

After :- Running test suite test_threads_scheduling tc_start() - test_priority_cooperative

PASS - test_priority_cooperative. tc_start() - test_priority_preemptible

PASS - test_priority_preemptible. tc_start() - test_yield_cooperative

PASS - test_yield_cooperative. tc_start() - test_sleep_cooperative

PASS - test_sleep_cooperative. tc_start() - test_sleep_wakeup_preemptible

PASS - test_sleep_wakeup_preemptible. tc_start() - test_time_slicing_preemptible

PASS - test_time_slicing_preemptible. tc_start() - test_time_slicing_disable_preemptible

PASS - test_time_slicing_disable_preemptible. tc_start() - test_lock_preemptible

PASS - test_lock_preemptible. tc_start() - test_unlock_preemptible

PASS - test_unlock_preemptible.

Patch : Scheduler : Account kernel time keeping when thread releases cpu befo *Issue related to cortex_m_systick /local apic/ hpet timers ***** When multiple pre-emptive threads of same priority are queued together, then these threads should execute in round robin fashion with equal priority, but in certain scnerios this scheduling is not fair and one thread gets lower time slide than expected. For example if threads A, B and C are of equal priority and are queued in order A -> B ->C. Let us assume Thread A is running with time slide as 200mSec. So timer is programmed for 200 ticks and thread A starts running. Let us assume thread A has run for 100 ticks and it gets semtake and not at this stage A is put at end of queue and Thread B is in top (B -> C ->A). Now thread B starts running and on timer expire after 100 ticks. At the end of 200 ticks (100 ticks spent by A + 100 ticks by B), timer expired. now elapsed ticks updated = 200. and which results in elapsed_timeslice for B too to 200 (but actually B has taken only 100 ticks).

So this patch sets the time out to remaining time again and accounts the spend time. So that after 100 ticks expired for A it accounts only 100 ticks and B gets fair chance.

Before Case :- PASS - test_sched_is_preempt_thread. tc_start() - test_slice_reset thread[0] elapsed slice 100, expected 100 thread[1] elapsed slice 65, expected 200 thread[2] elapsed slice 199, expected 200 thread[0] elapsed slice 100, expected 100 thread[1] elapsed slice 77, expected 200 thread[2] elapsed slice 199, expected 200

After Case :- tc_start() - test_slice_reset thread[0] elapsed slice 100, expected 100 thread[1] elapsed slice 200, expected 200 thread[2] elapsed slice 200, expected 200 thread[0] elapsed slice 100, expected 100 thread[1] elapsed slice 200, expected 200 thread[2] elapsed slice 200, expected 200

PASS - test_slice_reset.

Also in certain corner cases still i observed that we are hitting the issue (at low frequency) but the above patch fix most of issues, I will continue looking into corner cases when we are getting issue and will update.

nashif commented 7 years ago

by Youvedeep Singh:

Attached is the pull request.

nashif commented 7 years ago

by Youvedeep Singh:

This patch got merged.

nashif commented 7 years ago

by Youvedeep Singh:

Patch merged so closing the issue.