zephyrproject-rtos / zephyr

Primary Git Repository for the Zephyr Project. Zephyr is a new generation, scalable, optimized, secure RTOS for multiple hardware architectures.
https://docs.zephyrproject.org
Apache License 2.0
11.01k stars 6.7k forks source link

Kernel interrupt test hangs when running with IRQ lock intolerant timer #23519

Open stephanosio opened 4 years ago

stephanosio commented 4 years ago

Describe the bug The tests/kernel/interrupt test (in particular, test_prevent_interruption) hangs when running on a platform that uses a system timer that requires the timer interrupt to be serviced in order for z_timer_cycle_get_32 to return the correct value (i.e. "IRQ lock intolerant timer").

These platforms include:

To Reproduce Build and run tests/kernel/interrupt for any of the aforementioned platforms.

Expected behavior test_prevent_interruption successfully completes.

Screenshots or console output

*** Booting Zephyr OS build zephyr-v2.2.0-381-g1c6487474842  ***
Running test suite interrupt_feature
===================================================================
starting test - test_isr_dynamic
PASS - test_isr_dynamic
===================================================================
starting test - test_nested_isr
SKIP - test_nested_isr
===================================================================
starting test - test_prevent_interruption
locking interrupts

Environment (please complete the following information):

Analysis This happens because test_prevent_interruption calls k_busy_wait, which uses z_timer_cycle_get_32 for time reference, after locking (disabling) interrupts.

Since the z_timer_cycle_get_32 implementation for the aforementioned platforms returns the value of an internal counter that is updated by the timer ISR, the k_busy_wait function, which relies on that function, hangs when interrupt is locked.

Proposed solution ~Use a wait mechanism that does not rely on the z_timer_cycle_get_32 function in the test.~

Since it is indeed expected of z_timer_cycle_get_32 (and therefore k_busy_wait) to work when IRQ is locked, we must fix the system timer drivers for the aforementioned platforms instead.

stephanosio commented 4 years ago

cc @andrewboie

stephanosio commented 4 years ago

Related to #22956

andrewboie commented 4 years ago

@stephanosio the problem here is that when IRQs are locked, k_busy_wait() hangs forever. This is the bug, the issue is not the test, the test happens to expose it.

I would NACK any patch that changes the test to sweep this issue under the rug. If anything, a specific dedicated test that shows that k_busy_wait() works with IRQs locked needs to be added

andyross commented 4 years ago

Jumping in with a quick triage on those two drivers, just for postericty: yeah, what both of these do is return a static in-memory count from z_timer_cycle_get_32() which is updated only from the timer ISR. So anything that blocks the timer ISR from arriving will prevent this clock from advancing.

The intent of this API is that it should be a synchronous read from some sort of hardware counter, so this scheme (implementing the cycle counter in terms of the timer driver) really can't be correct. Though apparently we can get away with it for a lot of tests (most k_busy_wait() happens in drivers, which are platform-specific anyway).

I don't know the Altera and Xilinx hardware in question. Surely they have a register where we can read the current count, no?

andyross commented 4 years ago

This (unused) register in the Xilinx driver looks promising:

#define XTTCPS_COUNT_VALUE_OFFSET  0x00000018U  /**< Current Counter Value */

The Altera driver is on top of a HAL with thicker headers, so I gave up looking. I'm sure it has one too though.

andrewboie commented 4 years ago

@andyross Altera requires another timer source AFAIK. There is a comment that explains this. This is a virtual FPGA CPU so it's possible to configure this when defining the CPU itself. Nobody's had time or inclinationt to add support yet.

    /* Per the Altera Embedded IP Peripherals guide, you cannot
     * use a timer instance for both the system clock and timestamps
     * at the same time.
     *
     * Having this function return accumulated_cycle_count + get_snapshot()
     * does not work reliably. It's possible for the current countdown
     * to reset to the next interval before the timer interrupt is
     * delivered (and accumulated cycle count gets updated). The result
     * is an unlucky call to this function will appear to jump backward
     * in time.
     *
     * To properly obtain timestamps, the CPU must be configured with
     * a second timer peripheral instance that is configured to
     * count down from some large initial 64-bit value. This
     * is currently unimplemented.
     */
stephanosio commented 4 years ago

If anything, a specific dedicated test that shows that k_busy_wait() works with IRQs locked needs to be added

@andrewboie That would be helpful. There seem to be quite a few (at least 3 IIRC) timer drivers that do not conform to this.

stephanosio commented 4 years ago

@andyross For Xilinx (and possibly Altera), we would need to do one of the following:

  1. Switch the timer mode of operation from "Interval" to "Match" (i.e. convert it to a "running timer") and return its register value.
  2. Use another timer as an always-running timer and return its register value.
stephanosio commented 4 years ago

It seems like I might already have a fix here (I was going to implement tickless support for TTC, but decided to drop it because QEMU didn't like it; #19869): https://github.com/stephanosio/zephyr/commit/9da8fdef6db8d29e658cbafb3e69f9fca6dd6961

I will create a PR for this today (without tickless support).

stephanosio commented 4 years ago

As far as I can see, the litex_timer should have the same problem.

@mateusz-holenko @kgugala @pgielda Could you confirm?

mateusz-holenko commented 4 years ago

@stephanosio It seems so after reading the code, but we'll verify this on HW.

andyross commented 4 years ago

One note relative to the issues being discussed here: devices without access to a proper cycle counter can solve this alternatively with CONFIG_ARCH_HAS_CUSTOM_BUSY_WAIT and an appropriate amount of magic (e.g. a hand-tuned assembly loop). The use of z_timer_get_cycle_32() is just the default implementation.

mateusz-holenko commented 4 years ago

@stephanosio I confirm that the test fails on the HW:

(venv) grey@grey:~/localantmicro/zephyr/sanity-out/litex_vexriscv/arch.interrupt$ sudo flterm --port /dev/ttyUSB1 --kernel zephyr/zephyr.bin --speed 115200
[sudo] password for grey: 
[FLTERM] v2.4-31-g8898c78 Starting...
        __   _ __      _  __
       / /  (_) /____ | |/_/
      / /__/ / __/ -_)>  <
     /____/_/\__/\__/_/|_|
 (c) Copyright 2012-2019 Enjoy-Digital
 BIOS built on Mar 16 2020 09:58:40
 BIOS CRC passed (74875417)
 Migen git sha1: d11565a
 LiteX git sha1: 02bfda5e
--=============== SoC ==================--
CPU:       VexRiscv @ 100MHz
ROM:       32KB
SRAM:      32KB
L2:        8KB
MAIN-RAM:  262144KB
--========== Initialization ============--
Initializing SDRAM...
SDRAM now under software control
Read leveling:
m0, b0: |11111111110000000000000000000000| delays: 05+-05
m0, b1: |00000000000000111111111111000000| delays: 20+-06
m0, b2: |00000000000000000000000000000001| delays: 31+-00
m0, b3: |00000000000000000000000000000000| delays: -
m0, b4: |00000000000000000000000000000000| delays: -
m0, b5: |00000000000000000000000000000000| delays: -
m0, b6: |00000000000000000000000000000000| delays: -
m0, b7: |00000000000000000000000000000000| delays: -
best: m0, b1 delays: 20+-06
m1, b0: |11111111110000000000000000000000| delays: 05+-05
m1, b1: |00000000000000111111111111000000| delays: 20+-06
m1, b2: |00000000000000000000000000000001| delays: 31+-00
m1, b3: |00000000000000000000000000000000| delays: -
m1, b4: |00000000000000000000000000000000| delays: -
m1, b5: |00000000000000000000000000000000| delays: -
m1, b6: |00000000000000000000000000000000| delays: -
m1, b7: |00000000000000000000000000000000| delays: -
best: m1, b1 delays: 20+-06
SDRAM now under hardware control
Memtest OK
--============== Boot ==================--
Booting from serial...
Press Q or ESC to abort boot completely.
sL5DdSMmkekro
[FLTERM] Received firmware download request from the device.
[FLTERM] Uploading kernel (26828 bytes)...
[FLTERM] Upload complete (7.7KB/s).
[FLTERM] Booting the device.
[FLTERM] Done.
Executing booted program at 0x40000000
--============= Liftoff! ===============--
*** Booting Zephyr OS build zephyr-v2.2.0-313-g4fd38265224d  ***
Running test suite interrupt_feature
===================================================================
starting test - test_isr_dynamic
PASS - test_isr_dynamic
===================================================================
starting test - test_nested_isr
Triggering irq : 90
PASS - test_nested_isr
===================================================================
starting test - test_prevent_interruption
locking interrupts

We'll look into this.

stephanosio commented 4 years ago

@stephanosio I confirm that the test fails on the HW:

@mateusz-holenko Thanks for checking. I have created an issue to track this: #23622.