zephyrproject-rtos / zephyr

Primary Git Repository for the Zephyr Project. Zephyr is a new generation, scalable, optimized, secure RTOS for multiple hardware architectures.
https://docs.zephyrproject.org
Apache License 2.0
10.77k stars 6.57k forks source link

Poor sinf/cosf performance compared to the Segger math libraries #23723

Closed too1 closed 3 years ago

too1 commented 4 years ago

The performance of the sinf/cosf functions in NEWLIB on Zephyr seems very poor compared to the Segger C libraries used in the nRF5 SDK for the nRF52 series. Preliminary numbers (until I have time to do a more controlled test) imply a runtime of around 15us on the nRF53 @ 128MHz, vs ~2.5us on the nRF52 @ 64MHz.

An interesting aspect is that replacing cosf with cos only leads to a 30-40% increase in runtime, which indicates that the cosf implementation doesn't properly utilize the floating point unit in the nRF53 (32-bit floating point with HW acceleration should be significantly faster than 64-bit float without HW acceleration).

Update: I did some more testing, and to my surprise found that in a simpler test the difference between the SES libraries and the NEWLIB libraries was much smaller. The following code snippet ran at 2.65ms in the SES build vs 2.95ms using the current Zephyr master branch with NEWLIB.

#define ITERATIONS 1000
float PI = 3.141592653f;
float input = -2.0f*PI;
float epsilon = 4.0f*PI / (float)ITERATIONS;
float result;
// Start measurement
for(int i = 0; i < ITERATIONS; i++)
{
  result = cosf(input);
  input += epsilon;
}
// Stop measurement

The CONFIG_NEWLIB_LIBC_NANO parameter did not appear to have an impact.

Investigating my original algorithm I realized that the input values to the cosf/sinf functions were much larger, and when I changed my test above to initialize input to -200PI instead of -2PI the results were dramatically different. The Zephyr runtime increased to 39.6ms (more than 13 times slower), while the SES implementation only increased runtime by about 1%.

So I think this explains why I saw very poor performance in the first place. Either the NEWLIB implementation is not very well suited to handle large positive or negative input values to the cosf/sinf function, or something is wrong in my project configuration.

stephanosio commented 4 years ago

The newlib libm implementation is in C and there is currently no ARM-optimised (asm) implementation available (hence, slow performance).

I expect these math performance issues to be mostly addressed by adding the CMSIS-DSP library support (I am currently working on this; there is a preliminary PR for it too: #21600).

stephanosio commented 4 years ago

If you can, please try benchmarking against arm_sin_f32 and arm_cos_f32 in the CMSIS-DSP PR (#21600); tests/lib/cmsis_dsp/fastmath would be a good base to use.

stephanosio commented 4 years ago

which indicates that the cosf implementation doesn't properly utilize the floating point unit

Please check/post the asm listing of the sinf and cosf functions.

too1 commented 4 years ago

I tried using PR #21600, but I don't see improved performance (actually it is slightly slower, 3.0/40.6ms vs 2.96/39.6ms).

I use this prj.conf, with floating point enabled: CONFIG_GPIO=y CONFIG_NEWLIB_LIBC=y

.map file added for reference: zephyr_map.zip

Do I need to do something to enable the new math libraries?

stephanosio commented 4 years ago

@too1 When running performance tests, please make sure that all operational variables are declared volatile so that the compiler doesn't try anything funny (e.g. nuking them).

I have tested newlib cosf and CMSIS-DSP arm_cos_f32 functions and I see quite a significant performance difference between them:

input cosf
(newlib)
arm_cos_f32
(CMSIS-DSP)
2 * pi 1,408,960 577,594
200 * pi 15,040,964 578,479

(in cycles, measured with DWT on Cortex-M7; ATSAME70 Rev. B)

Test code: https://gist.github.com/stephanosio/3a2de5e7f192acd5860d9ce866e1f2ba

You can also checkout the following test branch and run the modified samples/hello_world: https://github.com/stephanosio/zephyr/tree/cmsis_dsp_perf_test/samples/hello_world

stephanosio commented 3 years ago

Closing this since the relatively slow performance of the trigonometric functions provided by the newlib (libm) on ARM is really an upstream newlib issue and not a Zephyr issue.

A workaround has been provided through the CMSIS-DSP library, which provides the optimised high performance trigonometric function implementations for the ARM targets.