Implement Cortex-R floating-point support

stephanosio commented 5 years ago

At the time of writing, ARM Cortex-R port does not support the use of hardware floating-point unit (VFP and NEON).

Considering common application scenarios for Cortex-R (real-time processing), it is imperative that hardware floating-point unit support is available for it; otherwise, practical usability of the Cortex-R port becomes questionable.

Overview

An overview of the hardware floating-point unit for Cortex-R is as follows:

Cortex-R4 and Cortex-R5
- VFPv3-D16 may be optionally instantiated.
- 16 double-precision registers are available (instead of 32 in full VFPv3).
- Single-precision and double-precision operations are supported.
- Half-precision operations are not supported.
- Vector operations are not supported (software emulation is possible).
- R5 may also optionally instantiate single-precision only VFPv3-D16.
Cortex-R7 and Cortex-R8
- VFPv3-D16-FP16 may be optionally instantiated.
- Two variants of VFPv3-D16-FP16 are available: Optimised and Full.
- Optimised variant implements only single-precision and half-precision operations.
- Full variant implements single-precision, half-precision and double-precision operations.
- Vector operations are not supported (software emulation is possible).
Cortex-R52
- VFPv3-FP16 or VFPv3-D16-FP16 must be instantiated.
- Two variants are available: SP-only and Full Advanced SIMD.
- SP-only variant
- VFPv3-D16 (i.e. with 16 double-precision registers) is instantiated.
- Half-precision and single-precision operations are supported.
- Full Advanced SIMD variant
- VFPv3 (i.e. with 32 double-precision registers) is instantiated.
- NEON is instantiated.
- Half-precision, single-precision, double-precision and vector operations are supported (with some optional subset specified by MVFR).

Specifications

ARM Cortex-R floating-point support feature shall:

support Unshared FP registers mode and Shared FP registers mode.
1. In both modes, the kernel shall initialise the FPU. In addition, all FP registers shall be initialised if DCLS (dual-redundant core lock-step) is configured.
2. In Unshared FP registers mode, the kernel shall configure the FPU and leave it in operational state after booting.
3. In Shared FP registers mode, all threads shall assume "FP disabled" state by default.
optionally support emulation of the VFP instructions that are unimplemented by hardware.

ARM Cortex-R floating-point support feature, for Shared FP registers mode, shall:

manage FP enable status at thread level, in conformance with the kernel FP interface.
1. K_FP_REGS option shall be used to specify whether thread-wide floating-point support is enabled.
2. K_FP_REGS option may be (re-)enabled only for the threads that were initially created with the same option.
disable FPU after a context switch and re-enable it upon exception.
1. After a context switching occurs, FPU shall be disabled by setting FPEXC.EN=0.
2. FPU shall be re-enabled after the first FP instruction that caused undefined instruction exception if and only if K_FP_REGS option is set for the thread.
3. The first FP instruction that caused an undefined instruction exception shall be re-executed after setting FPEXC.EN=1.
store s0-s15 and FPSCR in exception stack frame and s16-s31 in thread context.
1. s0-s15 and FPSCR shall be stored in the exception stack frame in order to allow the optional use of FP registers inside a nested interrupt handler.
2. s16-s31 shall be preserved only for a thread context switch (and not for an exception entry).
implement lazy stacking of FP context.
1. All context switching-capable exception handler routines shall save only the basic stack frame upon entry.
2. s0-s15 shall be stored to the exception stack frame during a context switch if FPU is enabled; in which case, the thread being switched out shall be marked to indicate that FP context is saved.
3. s0-s15 shall be restored if the thread being switched in is marked to contain an FP context.
preserve s16-s31 during context switch only when FPU is enabled.
1. If FPU is enabled at the time of context switching, at least one FP instruction had been executed after the previous context switch and the FP context must preserved.
2. FP context preservation of s16-s31 shall be implied by that of s0-s15.

Note

This feature will also be applicable to Cortex-A port when it is added in the future, as Cortex-R architecture is very similar to Cortex-A.
Specification 3-ii should really apply to Cortex-M as well.

bbolen commented 4 years ago

@stephanosio I found your https://github.com/stephanosio/zephyr/commits/aarch32_non_m_fp_alt branch and pulled those changes. I wanted to see if you have done any more work on this issue, specifically have you implemented anything mentioned in the "Specifications 3-7" section above?

stephanosio commented 4 years ago

@stephanosio I found your https://github.com/stephanosio/zephyr/commits/aarch32_non_m_fp_alt branch and pulled those changes. I wanted to see if you have done any more work on this issue, specifically have you implemented anything mentioned in the "Specifications 3-7" section above?

@bbolen There is a Cortex-A (and -R) FP sharing implementation that sort of works here: https://github.com/ibirnbaum/zephyr/blob/armv7_cortex_a/arch/arm/core/aarch32/swap_helper.S

bbolen commented 4 years ago

@stephanosio thank you

bbolen commented 4 years ago

Can you elaborate on the description above with respect to items 5 and 6? I'm struggling to understand why the vfp registers would be saved on the exception stack and not the thread context in the normal case. I can see needing to temporarily put it on the exception stack in order for the exception handler to use the VFP unit, but I would assume they would be popped off that stack and pushed onto the thread context during the context switch.

bbolen commented 3 years ago

Here is a working implementation of floating point support for Cortex-R. It does lazy context switching. It is based on v2.3.0. There are some conflicts with HEAD, but it will be a while before I can get around to looking at those. I'm putting this out there in case others need a starting point for FPU support before I can get this merge worthy.

https://github.com/bbolen/zephyr/commits/cortex_r_fpu

legath commented 3 years ago

Some of R4F socs have double precision. For example quote from TI Hercules brochure Floating Point Unit (FPU) • FPU is compliant to IEEE754 • 16 double-word (64 bits) registers • 32 single-word (32 bits) registers • Supports features: – Single-precision and double-precision add, subtract, multiply, divide, multiply and accumulate, and square root operations – Conversions between fixed-point and floating-point data formats, etc – Comparisons – Underflow – Exceptions

shrhrw commented 3 years ago

Hi @bbolen , I am working on supporting a cortex-r5f chip in zephyr, and I've spliced in your code from this post: https://github.com/zephyrproject-rtos/zephyr/issues/19979#issuecomment-758091058

I have it building, however when trying to flash it onto the board I am encountering the following error: Debug: 387 144 cortex_a.c:301 cortex_a_exec_opcode(): exec opcode 0xee000e15 Debug: 388 145 armv4_5.c:496 arm_set_cpsr(): set CPSR 0x000003db: Undefined instruction mode, ARM state

When looking at the ARM documentation here: https://developer.arm.com/documentation/ddi0406/b/System-Level-Architecture/The-System-Level-Programmers--Model/Exceptions/Undefined-Instruction-exception?lang=en

I found this section: The Undefined Instruction exception can be used for:

software emulation of a coprocessor in a system that does not have the physical coprocessor hardware lazy context switching of coprocessor registers general-purpose instruction set extension by software emulation signaling an illegal instruction execution division by zero errors.

Do you know if my error is a coincidence, or related in the way described? If so, do you have a suggestion?

bbolen commented 3 years ago

It could be related. The FPU is usually disabled. When the code gets to a floating point instruction, an undefined instruction happens, the FPU gets enabled, and execution starts again on the floating point instruction that caused the fault. So one undefined instruction exception would be expected when using floating point, but it wouldn't crash anything.

I'm unavailable for the rest of the week, but I'll look closer at your details above on Monday.

shrhrw commented 3 years ago

`Open On-Chip Debugger 0.11.0+dev-00242-g7036ed509-dirty (2021-08-03-17:04) Licensed under GNU GPL v2 For bug reports, read http://openocd.org/doc/doxygen/bugs.html Info : TI BE-32 quirks mode is enabled Info : XDS110: connected Info : XDS110: vid/pid = 0451/bef3 Info : XDS110: firmware version = 3.0.0.16 Info : XDS110: hardware version = 0x0029 Info : XDS110: connected to target via JTAG Info : XDS110: TCK set to 2500 kHz Info : clock speed 1500 kHz Info : JTAG tap: tms570.jrc tap/device found: 0x1b95a02f (mfg: 0x017 (Texas Instruments), part: 0xb95a, ver: 0x1) Info : JTAG tap: tms570.cpu enabled Info : tms570.cpu: hardware has 8 breakpoints, 8 watchpoints Info : starting gdb server for tms570.cpu on 3333 Info : Listening on port 3333 for gdb connections TargetName Type Endian TapName State

0* tms570.cpu cortex_r4 big tms570.cpu running

Info : JTAG tap: tms570.jrc tap/device found: 0x1b95a02f (mfg: 0x017 (Texas Instruments), part: 0xb95a, ver: 0x1) Info : JTAG tap: tms570.cpu enabled Warn : tms570.cpu: ran after reset and before halt ... Info : tms570.cpu: MPIDR level2 0, cluster 0, core 0, mono core, no SMT target halted in ARM state due to debug-request, current mode: Undefined instruction cpsr: 0x000003db pc: 0x00000004 D-Cache: disabled, I-Cache: disabled flash flash bank bank_id driver_name base_address size_bytes chip_width_bytes bus_width_bytes target [driver_options ...] flash banks flash init flash list gdb_flash_program ('enable'|'disable') nand program [address] [pre-verify] [verify] [reset] [exit]

Info : XDS110: disconnected FATAL ERROR: command exited with status 1`

-- Application: /home/smith/zephyrproject/zephyr/samples/hello_world -- Zephyr version: 2.7.0-rc1 (/home/smith/zephyrproject/zephyr), build: v1.12.0-34809-g29387287d9f7 -- Found Python3: /usr/bin/python3.8 (found suitable exact version "3.8.10") found components: Interpreter -- Found west (found suitable version "0.11.1", minimum required is "0.7.1") -- Board: hercules_tms570lc43x -- Cache files will be written to: /home/smith/.cache/zephyr -- Using toolchain: zephyr 0.13.0 (/home/smith/zephyr-sdk-0.13.0) -- Open On-Chip Debugger 0.11.0+dev-00358-g6c1e1a212-dirty (2021-08-26-13:54)

My local zephyr repo was cloned from your repo here: https://github.com/bbolen/zephyr/commits/cortex_r_fpu and I updated it to the latest version.

zephyrproject-rtos / zephyr