zephyrproject-rtos / zephyr

Primary Git Repository for the Zephyr Project. Zephyr is a new generation, scalable, optimized, secure RTOS for multiple hardware architectures.
https://docs.zephyrproject.org
Apache License 2.0
10.61k stars 6.5k forks source link

acrn_ehl_crb: test case of arch.interrupt.prevent_interruption failed #32603

Closed enjiamai closed 3 years ago

enjiamai commented 3 years ago

Describe the bug Test case arch.interrupt.prevent_interruption failed due it hang up. (It hangs up at calling k_busy_wait() while IRQ is locked.)

To Reproduce Steps to reproduce the behavior:

  1. Make sure BIOS configuration is correct for ACRN, and include/arch/x86/memory.ld line 44, LOCORE_BASE modified from 0x1000 to 0x8000.
  2. Run this command from the zephyrproject: twister -p acrn_ehl_crb --device-testing --device-serial-pty="/opt/remotehw/acrn-test-pty.exp,ehlsku11" --west-flash="/opt/remotehw/remotehw-x86-acrn.sh,ehlsku11" -vv -T tests/kernel/interrupt
  3. Check handler.log in twister.out in directory to see the error log.

Expected behavior The test case is passed as expected.

Impact ACRN on ehl_crb platform has little lower pass rate.

Logs and console output

The error log is like this:

*** Booting Zephyr OS build zephyr-v2.5.0-484-g23f699bd907f  ***
Running test suite interrupt_feature
===================================================================
START - test_isr_dynamic
SKIP - test_isr_dynamic
===================================================================
START - test_nested_isr
SKIP - test_nested_isr
===================================================================
START - test_prevent_interruption
locking interrupts

It hangs up after printing out "locking interrupts".

UPDATE - The test progresses with the use of newer APIC timer driver (APIC TSC Deadline timer driver):

32603.

DEBUG   - DEVICE: START - test_prevent_interruption
DEBUG   - DEVICE: locking interrupts
DEBUG   - DEVICE: unlocking interrupts
DEBUG   - DEVICE: 
DEBUG   - DEVICE: Assertion failed at WEST_TOPDIR/zephyr/tests/kernel/interrupt/src/prevent_irq.c:58: test_prevent_interruption: (handler_result not equal to HANDLER_TOKEN)
DEBUG   - DEVICE: timer should have fired
DEBUG   - DEVICE: FAIL - test_prevent_interruption in 2.2 seconds

And the expected should be:

===================================================================
START - test_prevent_interruption
locking interrupts
unlocking interrupts
timer fired
 PASS - test_prevent_interruption
===================================================================
Test suite interrupt_feature succeeded
===================================================================
PROJECT EXECUTION SUCCESSFUL

Environment (please complete the following information):

UPDATE -

aasthagr commented 3 years ago

I was able to reproduce this issue, but need to look at the root cause.

jenmwms commented 3 years ago

Hi @enjiamai can you please post the log for the failed case? The log in the description shows success. I am working to reproduce this now that it is assigned to me. Previous assignee cannot work on it at this time so assigned to me.

enjiamai commented 3 years ago

Hi @enjiamai can you please post the log for the failed case? The log in the description shows success. I am working to reproduce this now that it is assigned to me. Previous assignee cannot work on it at this time so assigned to me.

Hi @jenmwms , I am so sorry, I paste the wrong log that is correct one, now I update the error log for you reference, thank you!

jenmwms commented 3 years ago

Is the issue that it appears to hang after locking interrupts? I get this output which looks like the test is getting stuck:

 Booting Zephyr OS build zephyr-v2.5.0-611-g48321ef2525c  ***
Running test suite interrupt_feature
===================================================================
START - test_isr_dynamic
SKIP - test_isr_dynamic
===================================================================
START - test_nested_isr
SKIP - test_nested_isr
===================================================================
START - test_prevent_interruption
locking interrupts

Do we have a bisect yet? I will start working on that next, shouldn't be too far back tho.

enjiamai commented 3 years ago

Hi, yes, this is hang up after locking interrupts. Actually, it is hang up in k_busy_wait() when interrupt locked. And I did not have a bisect yet, I can do that~

jenmwms commented 3 years ago

@enjiamai np, was asking jic. Thanks for your reply! I can work on bisecting this.

LeiW000 commented 3 years ago

I don't think the issue should be "low". It seems to me that there is something wrong with k_busy_wait(), which is an important API.

enjiamai commented 3 years ago

I don't think the issue should be "low". It seems to me that there is something wrong with k_busy_wait(), which is an important API.

Make sense, priorities to medium.

jenmwms commented 3 years ago

It seems to me that there is something wrong with k_busy_wait(),

I understand why we should inspect k_busy_wait(), thanks for noting this. I am curious, if there is an issue with that service, we would anticipate it failing in other tests as well. Do we see indication of this?

enjiamai commented 3 years ago

I understand why we should inspect k_busy_wait(), thanks for noting this. I am curious, if there is an issue with that service, we would anticipate it failing in other tests as well. Do we see indication of this?

No, we did not see this test case failed in other platform, even as ehl crb bare metal. So it's only on ACRN ehl crb.

jenmwms commented 3 years ago

. So it's only on ACRN ehl crb

Ok that helps to know it is this platform only. However, my comment was regarding the kernel service itself k_busy_wait(). If something is broken in that service, we would see that service fail in other tests where it is used. It is used extensively across tests, across platforms. So, even if it is only on acrn_ehl_crb, we would anticipate k_busy_wait() service to fail in other cases on this platform. We can keep an eye out.

This has been failing for some time. It would help to know if this has passed in the recent past, and roughly when that changed. I am walking it back - it still fails back at commit 6c56914 debug: coredump: add shell commands for logging backend Jan 21. More soon...

jenmwms commented 3 years ago

Hi @enjiamai - I cannot find a point when this test passed for this platform. Perhaps my setup is incorrect, but for me this fails all the way back at the platform introduction b644432. Can you please confirm?

Booting Zephyr OS build zephyr-v2.4.0-2925-gb64443272033  ***
Running test suite interrupt_feature
===================================================================
START - test_isr_dynamic
SKIP - test_isr_dynamic
===================================================================
START - test_nested_isr
SKIP - test_nested_isr
===================================================================
START - test_prevent_interruption
locking interrupts

Next step instead of bisecting at last-known-working commit, I will investigate the test case and k_busy_wait() usage.

enjiamai commented 3 years ago

Hi @jenmwms, Yes, just like you said, I check it and also found this test case actually failed in the commit you mention. There is some thing wrong with previous test report. This issue happened more earlier than I excepted. Please give me some time to check it, thanks so much for your feedback!

jenmwms commented 3 years ago

I also tried walking the SDK version back a few releases jic, but that did not have effect. Still investigating things.

jenmwms commented 3 years ago

just a note - this is not an ACPI device so the CONFIG_KERNEL_VM_SIZE should be ok as is.

jenmwms commented 3 years ago

Ok, the issue predates the acrn_ehl_crb definitions. Looking more closely at these changes for the root cause - fe516b93a9 (passes) to 56f84b60ed (fails). Thanks @enjiamai for your help tracking this down!

jenmwms commented 3 years ago

Bisect resulted in the following as first bad commit. Seems relevant, I'll keep looking into it- c2c6bee036bb33eabdabd0587d0c4392f4e239f6 drivers/timer: Remove legacy APIC driver _Edit: Testing at this point before the acrn ehl crb definitions, this APIC driver change also requires a local change of the CONFIG_SYS_CLOCK_HW_CYCLES_PERSEC=1900000000. Still, no success resolving the issue tho. Let's try some good old prints to learn more about current status

jenmwms commented 3 years ago

The test progresses slightly if I change the following:

- k_busy_wait(MS_TO_US(1000));
+ k_sleep(K_MSEC(1000));
START - test_prevent_interruption
locking interrupts
timer fired
Assertion failed at WEST_TOPDIR/zephyr/tests/kernel/interrupt/src/prevent_irq.c:54: test_prevent_interruption: (handler_result equal to HANDLER_TOKEN)
timer interrupt was serviced while interrupts are locked
FAIL - test_prevent_interruption
jenmwms commented 3 years ago

In the test_prevent_interruption( ), the key is defined as type int and then being used to hold the return of the irq_lock( ) call. However, irq_lock( ) returns an unsigned integer "lock-out key". Something to note, maybe not the root cause yet...

jenmwms commented 3 years ago

using k_sleep() like mentioned above, the timer_handler( ) is getting called, which prints timer fired and the handler_result is assigned value of HANDLER_TOKEN which is 0xDEADBEEF.

The test is written such that the interrupts are locked, the timer is init'd, timer is started, then we wait for a time longer than the timer expiration in order to exercise the test of prevent interrupts. The timer ending should not interrupt the sleep.

I'm studying why this assert is thrown when it should not be:

zassert_not_equal(handler_result, HANDLER_TOKEN, "timer interrupt was serviced while interrupts are locked");
andyross commented 3 years ago

Are assertions not enabled for this build? Something should have bailed when you tried to sleep with interrupts locked. That's just not legal, what is happening (illegally -- again, I'm all but certain this is an ASSERT somewhere) is that you're entering sleep and context switching away, which then enables interrupts when the next thread (the idle thread in this case) gets to run. So the timer fires, incorrectly. That's the reason the test chose to busy wait to verify that the interrupt did not fire -- there's no other option in this context.

If this is hung in k_busy_wait() it should be easy enough to figure out where. A quick build shows that this is using drivers/timer/apic_timer.c as the timer driver, with CONFIG_APIC_TIMER_TSC unset. So it looks like the it's reading the APIC CCR register to get the time. I'm guessing ACRN is failing to advance that register when interrupts are masked, or otherwise making a mistake.

jenmwms commented 3 years ago

Thanks @andyross ! I'm sticking with the k_busy_wait(), gtk why it was selected. And yes, this was built with asserts enabled (by default for this test):

CONFIG_ASSERT=y
CONFIG_ASSERT_LEVEL=2

Is it advised to have the CONFIG_APIC_TIMER_TSC set? I can look into it on this board.

Sounds like I need to look into ACRN some here. Did the move to use the new APIC timer driver introduce this use of the APIC CCR register ? But to me it's strange, wouldn't we expect to see it failing to advance the register in other places/tests for this board?

jenmwms commented 3 years ago

Inspecting the "good" configurations for using the previous APIC timer driver at the bisect point, the .config shows that CONFIG_TSC_CYCLES_PER_SEC=0 (which is default setting from the driver) and CONFIG_SYS_CLOCK_HW_CYCLES_PER_SEC=25000000.

The x86 implementation of LOAPIC k_cycle_get_32() relies on the x86 TSC.
      This runs at the CPU speed and not the bus speed. If set to 0, the
      value of CONFIG_SYS_CLOCK_HW_CYCLES_PER_SEC will be used instead;
      many MCUs these values are the same.

Currently, acrn_ehl_crb_defconfig configures SYS_CLOCK_HW_CYCLES_PER_SEC=1900000000.

jenmwms commented 3 years ago

No obvious impact to this test when I revert to previous configurations for acrn_ehl_crb:

CONFIG_SYS_CLOCK_HW_CYCLES_PER_SEC=   25000000
CONFIG_APIC_TIMER_IRQ_PRIORITY=2
jenmwms commented 3 years ago

I got feedback that the APIC Timer invariant TSC configurations should be the same as the native EHL configurations. However, the APIC Timer is not a selected configuration for EHL natively. Looking at the x86 SW Developer Guide (section 17.17) does not specify what this should be but does reiterate what is stated/summarized in the driver for the TSC values and CPUID instruction to lookup the values.

jenmwms commented 3 years ago

something to consider, when acrn boots up there is a value printed for tsc calibration:

DEBUG   - DEVICE: ACRN Hypervisor
DEBUG   - DEVICE: calibrate_tsc, tsc_khz=1881600
jenmwms commented 3 years ago

something to consider, when acrn boots up there is a value printed for tsc calibration:

DEBUG   - DEVICE: ACRN Hypervisor
DEBUG   - DEVICE: calibrate_tsc, tsc_khz=1881600

I saw no obvious impact to resolve this issue using that TSC value.

jenmwms commented 3 years ago

In the test_prevent_interruption( ), the key is defined as type int and then being used to hold the return of the irq_lock( ) call. However, irq_lock( ) returns an unsigned integer "lock-out key". Something to note, maybe not the root cause yet...

FYI - https://github.com/zephyrproject-rtos/zephyr/issues/34023

jenmwms commented 3 years ago

Issue still remains if we set this (recently discovered needed for acrn_ehl_crb) in /boards/x86/acrn/acrn_ehl_crb_defconfig:

CONFIG_APIC_TIMER_IRQ=48

jenmwms commented 3 years ago

Found this - we can try similar for acrn_ehl_crb since our feedback was to use the same as native (ehl) tsc: https://github.com/zephyrproject-rtos/zephyr/blob/55e8807b08149a0a672a6faaa550500d8af2c994/soc/x86/elkhart_lake/Kconfig.defconfig#L24

config APIC_TIMER_TSC_M
    default 3

config APIC_TIMER_TSC_N
    default 249
jenmwms commented 3 years ago

^^ did not seem to make a difference. I tested implemented at the soc level, and also tried at the board level: CONFIG_APIC_TIMER_TSC_M=3 CONFIG_APIC_TIMER_TSC_N=249

jenmwms commented 3 years ago

^^ did not seem to make a difference. I tested implemented at the soc level, and also tried at the board level: CONFIG_APIC_TIMER_TSC_M=3 CONFIG_APIC_TIMER_TSC_N=249

Submitted https://github.com/zephyrproject-rtos/zephyr/pull/34790. But does not resolve this issue, so keep debugging...

jenmwms commented 3 years ago

FWIW - The test progresses with the use of newer APIC timer driver: https://github.com/zephyrproject-rtos/zephyr/issues/32603.

DEBUG   - DEVICE: START - test_prevent_interruption
DEBUG   - DEVICE: locking interrupts
DEBUG   - DEVICE: unlocking interrupts
DEBUG   - DEVICE: 
DEBUG   - DEVICE: Assertion failed at WEST_TOPDIR/zephyr/tests/kernel/interrupt/src/prevent_irq.c:58: test_prevent_interruption: (handler_result not equal to HANDLER_TOKEN)
DEBUG   - DEVICE: timer should have fired
DEBUG   - DEVICE: FAIL - test_prevent_interruption in 2.2 seconds
enjiamai commented 3 years ago

Hi, @jenmwms it seems like the error became different, the timer did not interrupt after 1000ms. maybe we can try a 1100ms, 1010ms, or even 1001, 1002, to see if this error still happened.

jenmwms commented 3 years ago

Hi, @jenmwms it seems like the error became different, the timer did not interrupt after 1000ms. maybe we can try a 1100ms, 1010ms, or even 1001, 1002, to see if this error still happened.

Thanks for the suggestion! However, no impact when I increased the time (even at very large like 2000) https://github.com/zephyrproject-rtos/zephyr/blob/fd20ad28cea65e3e286e8075f065138d6acb8cdb/tests/kernel/interrupt/src/prevent_irq.c#L56

andyross commented 3 years ago

Drive by suggestion: by my read our APIC driver isn't setting the interrupt to the right mode. Leaving a zero in the trigger mode field of the LVT leaves the interrupt as edge triggered[1] and not level triggered[2]. Try the patch below. It might work, if not (e.g. if it hangs the system in an interrupt storm) we'll need to dig to make sure the EOI is being delivered correctly. See the IA SDM vol 3, chapter 10 for details.

[1] meaning it's delivered once on the notional rising edge of the input. Obviously the internal clock isn't a wire, but the idea persists. If one of these arrives while interrupts are masked, it gets lost.

[2] Latched by hardware when it goes high until such time as the interrupt is acknowledged to the generating hardware and it goes low again. These will be delivered after interrupts are unmasked, which is what we want.

diff --git a/drivers/interrupt_controller/intc_loapic.c b/drivers/interrupt_controller/intc_loapic.c
index 9e28669522..50d8533283 100644
--- a/drivers/interrupt_controller/intc_loapic.c
+++ b/drivers/interrupt_controller/intc_loapic.c
@@ -164,7 +164,7 @@ void z_loapic_enable(unsigned char cpu_number)

        /* lock the Local APIC interrupts */

-       x86_write_loapic(LOAPIC_TIMER, LOAPIC_LVT_MASKED);
+       x86_write_loapic(LOAPIC_TIMER, LOAPIC_LVT_MASKED|LOAPIC_LEVEL);
        x86_write_loapic(LOAPIC_ERROR, LOAPIC_LVT_MASKED);

        if (loApicMaxLvt >= LOAPIC_LVT_P6) {
jenmwms commented 3 years ago

Thanks @andyross . That patch for + LOAPIC_LEVEL did not resolve the issue, unfortunately.

jenmwms commented 3 years ago

Does this work better on ACRN v2.5 (unstable, currently)? We see some differences between v2.4 and the latest upstream.

enjiamai commented 3 years ago

Hi, @jenmwms , it seems like this doesn't work better on v2.5-unstable, it's sad.

ACRN:\>vm_console

----- Entering VM 0 Shell -----
*** Booting Zephyr OS build zephyr-v2.4.0-8720-g6c9945aafa00  ***
Running test suite interrupt_feature
===================================================================
START - test_isr_dynamic
vector(48)
 PASS - test_isr_dynamic in 0.1 seconds
===================================================================
START - test_nested_isr
 SKIP - test_nested_isr in 0.1 seconds
===================================================================
START - test_prevent_interruption
locking interrupts
unlocking interrupts

    Assertion failed at WEST_TOPDIR/zephyr/tests/kernel/interrupt/src/prevent_irq.c:58: test_prevent_interruption: (handler_result not equal to HANDLER_TOKEN)
timer should have fired
 FAIL - test_prevent_interruption in 2.3 seconds
===================================================================
START - test_isr_regular
irq(27)=vector(64)
irq(28)=vector(65)
not testing irq enable/disable
 PASS - test_isr_regular in 0.1 seconds
===================================================================
START - test_isr_offload_job_multiple
jenmwms commented 3 years ago

Hi, @jenmwms , it seems like this doesn't work better on v2.5-unstable, it's sad.

Thanks for checking! Interesting that the time elapsed increased .1 seconds. We will keep this bug open and keep investigating.