Unsatisfactory kernel benchmark results on SAM E-70 Xplained

tgorochowik commented 6 years ago

Hello all

Note: this issue was first noticed by @jukkar in #7907

It seems that context switching and other operations on sam e-70 are slower than they should be. Below are the results of a benchmark running on sam e-70 vs frdm k64f.

sam e-70 xplained:

***** Booting Zephyr OS v1.12.0-rc2-75-g7a18b083e *****
starting test - Time Measurement
Timing Results: Clock Frequency: 300 MHz
Tick overhead                                : 730 cycles ,  2433 ns
Thread Creation                              :1340 cycles ,  4466 ns
Thread cancel                                : 634 cycles ,  2113 ns
Thread abort                                 :1906 cycles ,  6353 ns
Thread Suspend                               :2024 cycles ,  6746 ns
Thread Resume                                :1950 cycles ,  6500 ns
Thread Yield                                 :1487 cycles ,  4956 ns
Thread Sleep                                 :1810 cycles ,  6033 ns
Heap Malloc                                  :3476 cycles , 11586 ns
Heap Free                                    :3081 cycles , 10270 ns
Semaphore Take with context switch           :2224 cycles ,  7413 ns
Semaphore Give with context switch           :2740 cycles ,  9133 ns
Semaphore Take without context switch        : 318 cycles ,  1060 ns
Semaphore Give without context switch        : 872 cycles ,  2906 ns
Mutex lock                                   : 752 cycles ,  2506 ns
Mutex unlock                                 :1047 cycles ,  3490 ns
Message Queue Put with context switch        :3154 cycles , 10513 ns
Message Queue Put without context switch     :1172 cycles ,  3906 ns
Message Queue get with context switch        :3564 cycles , 11880 ns
Message Queue get without context switch     :1232 cycles ,  4106 ns
MailBox synchronous put                      :3916 cycles , 13053 ns
MailBox synchronous get                      :3724 cycles , 12413 ns
MailBox asynchronous put                     :1688 cycles ,  5626 ns
MailBox get without context switch           :3164 cycles , 10546 ns
Timing Measurement  finished
PASS - main
===================================================================
===================================================================
PROJECT EXECUTION SUCCESSFUL

frdm k64f:

***** Booting Zephyr OS v1.12.0-rc2-75-g7a18b083e *****
starting test - Time Measurement
Timing Results: Clock Frequency: 120 MHz
Tick overhead                                : 186 cycles ,  1550 ns
Thread Creation                              : 335 cycles ,  2791 ns
Thread cancel                                : 109 cycles ,   908 ns
Thread abort                                 : 435 cycles ,  3625 ns
Thread Suspend                               : 465 cycles ,  3875 ns
Thread Resume                                : 480 cycles ,  4000 ns
Thread Yield                                 : 407 cycles ,  3391 ns
Thread Sleep                                 : 450 cycles ,  3750 ns
Heap Malloc                                  : 846 cycles ,  7050 ns
Heap Free                                    : 767 cycles ,  6391 ns
Semaphore Take with context switch           : 570 cycles ,  4750 ns
Semaphore Give with context switch           : 585 cycles ,  4875 ns
Semaphore Take without context switch        :  59 cycles ,   491 ns
Semaphore Give without context switch        : 193 cycles ,  1608 ns
Mutex lock                                   : 235 cycles ,  1958 ns
Mutex unlock                                 : 291 cycles ,  2425 ns
Message Queue Put with context switch        : 675 cycles ,  5625 ns
Message Queue Put without context switch     : 268 cycles ,  2233 ns
Message Queue get with context switch        : 770 cycles ,  6416 ns
Message Queue get without context switch     : 270 cycles ,  2250 ns
MailBox synchronous put                      : 865 cycles ,  7208 ns
MailBox synchronous get                      : 825 cycles ,  6875 ns
MailBox asynchronous put                     : 435 cycles ,  3625 ns
MailBox get without context switch           : 730 cycles ,  6083 ns
Timing Measurement  finished
PASS - main
===================================================================
===================================================================
PROJECT EXECUTION SUCCESSFUL

What was checked so far:

There doesn't seem to be any defconfig differences that could add such overhead
Clock settings on same70 seem to be correct according to the datasheet

Is there any reason why sam-e70 would need much more time perform the same task despite having a faster clock? Is there any room for improvements? Any insights or explanations are welcome!

Thank you!

aurel32 commented 6 years ago

I don't have such a board nor chip, but I might have an idea about the issue. Looking at the code, it seems that the caches are not enabled during boot on the SAM E70 SoC. You might want something like this commit: https://github.com/aurel32/zephyr/commit/bb67dbc90d1ca86331cb3f0a794165fa0768fd64

The corresponding patch for the SAM E70 SoC is probably that one (untested):

diff --git a/arch/arm/soc/atmel_sam/same70/soc.c b/arch/arm/soc/atmel_sam/same70/soc.c
index 86ed0d10b..0a37d6683 100644
--- a/arch/arm/soc/atmel_sam/same70/soc.c
+++ b/arch/arm/soc/atmel_sam/same70/soc.c
@@ -227,6 +227,9 @@ static int atmel_same70_init(struct device *arg)

        key = irq_lock();

+       SCB_EnableICache();
+       SCB_EnableDCache();
+
        /* Clear all faults */
        _ClearFaults();

nashif commented 6 years ago

@aurel32 the above does not help on its own, I just noticed that disabling MPU on this SoC with your patch, helps significantly:

***** Booting Zephyr OS v1.12.0-rc2-96-gb5a3ddf7d7 *****
starting test - Time Measurement
Timing Results: Clock Frequency: 300 MHz
Tick overhead                                : 502 cycles ,  1673 ns
Thread Creation                              : 327 cycles ,  1090 ns
Thread cancel                                : 216 cycles ,   720 ns
Thread abort                                 : 515 cycles ,  1716 ns
Thread Suspend                               : 395 cycles ,  1316 ns
Thread Resume                                : 282 cycles ,   940 ns
Thread Yield                                 : 205 cycles ,   683 ns
Thread Sleep                                 : 471 cycles ,  1570 ns
Heap Malloc                                  : 573 cycles ,  1910 ns
Heap Free                                    : 511 cycles ,  1703 ns
Semaphore Take with context switch           : 353 cycles ,  1176 ns
Semaphore Give with context switch           : 398 cycles ,  1326 ns
Semaphore Take without context switch        :  55 cycles ,   183 ns
Semaphore Give without context switch        : 154 cycles ,   513 ns
Mutex lock                                   : 114 cycles ,   380 ns
Mutex unlock                                 : 162 cycles ,   540 ns
Message Queue Put with context switch        : 565 cycles ,  1883 ns
Message Queue Put without context switch     : 207 cycles ,   690 ns
Message Queue get with context switch        : 597 cycles ,  1990 ns
Message Queue get without context switch     : 192 cycles ,   640 ns
MailBox synchronous put                      : 924 cycles ,  3080 ns
MailBox synchronous get                      : 590 cycles ,  1966 ns
MailBox asynchronous put                     : 362 cycles ,  1206 ns
MailBox get without context switch           : 532 cycles ,  1773 ns
Timing Measurement  finished
PASS - main

nashif commented 6 years ago

@AdithyaBaglody you added MPU support to this board AFAIK, can you please take a look?

andrewboie commented 6 years ago

Note that the sam_e70_xplained and the frdm_k64f use different MPU parts (ARM vs NXP).

The test currently is running with both CONFIG_USERSPACE and CONFIG_HW_STACK_PROTECTION disabled. Not clear to me why this kills our performance so badly, we are not calling the context switch hooks like configure_mpu_stack_guard(), configure_mpu_mem_domain(), or configure_mpu_user_context().

Is there some kind of inherent performance penalty with simply enabling the ARM MPU hardware?

lpereira commented 6 years ago

Every time there's a privilege level change, an ISB instruction is executed. This flushes the pipeline. Executing this instruction is a recommended practice, to avoid half-executed instructions from being executed in the old privilege level, but there's a performance penalty. I wonder if it's related. Maybe you could comment out those (in userspace.S and swap_helper.S) and benchmark again?

andrewboie commented 6 years ago

Every time there's a privilege level change

This is tests/benchmarks/timing_info. Both user mode and HW stack protection are disabled. It's just the MPU turned on and that's it, AFAICT all the isb calls on context switch are ifdefed around CONFIG_USERSPACE.

MaureenHelm commented 6 years ago

Results on mimxrt1050_evk, which has the ARM MPU, not the NXP MPU:

***** Booting Zephyr OS v1.12.0-rc2-75-g7a18b08 *****
starting test - Time Measurement
Timing Results: Clock Frequency: 600 MHz
Tick overhead                                : 118 cycles ,   196 ns
Thread Creation                              : 189 cycles ,   315 ns
Thread cancel                                :  89 cycles ,   148 ns
Thread abort                                 : 288 cycles ,   480 ns
Thread Suspend                               : 312 cycles ,   520 ns
Thread Resume                                : 286 cycles ,   476 ns
Thread Yield                                 : 213 cycles ,   355 ns
Thread Sleep                                 : 277 cycles ,   461 ns
Heap Malloc                                  : 550 cycles ,   916 ns
Heap Free                                    : 501 cycles ,   835 ns
Semaphore Take with context switch           : 345 cycles ,   575 ns
Semaphore Give with context switch           : 383 cycles ,   638 ns
Semaphore Take without context switch        :  53 cycles ,    88 ns
Semaphore Give without context switch        : 148 cycles ,   246 ns
Mutex lock                                   : 115 cycles ,   191 ns
Mutex unlock                                 : 162 cycles ,   270 ns
Message Queue Put with context switch        : 489 cycles ,   815 ns
Message Queue Put without context switch     : 178 cycles ,   296 ns
Message Queue get with context switch        : 523 cycles ,   871 ns
Message Queue get without context switch     : 196 cycles ,   326 ns
MailBox synchronous put                      : 567 cycles ,   945 ns
MailBox synchronous get                      : 588 cycles ,   980 ns
MailBox asynchronous put                     : 255 cycles ,   425 ns
MailBox get without context switch           : 522 cycles ,   870 ns
Timing Measurement  finished
PASS - main
===================================================================
===================================================================
PROJECT EXECUTION SUCCESSFUL

aurel32 commented 6 years ago

I am able to reproduce the same behavior on a STM32F723 Discovery board. The support for this board is not yet available in master, but the pull request #7284 gives basic support for the STM32F7 SoC, and I added a few patches on top of it (see https://github.com/aurel32/zephyr/tree/stm32f7).

During the review of #7284, @erwango requested to not activate the MPU on this SoC yet due to possible performances issues. Therefore he might have some hints about what happens.

Sorry for late catch, but I'm not sure about activating MPU right now. MPU for Cortex M7 has specific options which might affect performance. Hence I prefer this is activated specific check/study rather than by default.

Here are the result on that board without the MPU:

***** Booting Zephyr OS v1.12.0-rc1-313-g7630f784f *****
starting test - Time Measurement
Timing Results: Clock Frequency: 216 MHz
Tick overhead                                :  94 cycles ,   435 ns
Thread Creation                              : 281 cycles ,  1300 ns
Thread cancel                                : 137 cycles ,   634 ns
Thread abort                                 : 426 cycles ,  1972 ns
Thread Suspend                               : 363 cycles ,  1680 ns
Thread Resume                                : 302 cycles ,  1398 ns
Thread Yield                                 : 231 cycles ,  1069 ns
Thread Sleep                                 : 353 cycles ,  1634 ns
Heap Malloc                                  : 541 cycles ,  2504 ns
Heap Free                                    : 506 cycles ,  2342 ns
Semaphore Take with context switch           : 338 cycles ,  1564 ns
Semaphore Give with context switch           : 429 cycles ,  1986 ns
Semaphore Take without context switch        :  53 cycles ,   245 ns
Semaphore Give without context switch        : 150 cycles ,   694 ns
Mutex lock                                   : 156 cycles ,   722 ns
Mutex unlock                                 : 170 cycles ,   787 ns
Message Queue Put with context switch        : 498 cycles ,  2305 ns
Message Queue Put without context switch     : 194 cycles ,   898 ns
Message Queue get with context switch        : 563 cycles ,  2606 ns
Message Queue get without context switch     : 228 cycles ,  1055 ns
MailBox synchronous put                      : 815 cycles ,  3773 ns
MailBox synchronous get                      : 613 cycles ,  2837 ns
MailBox asynchronous put                     : 341 cycles ,  1578 ns
MailBox get without context switch           : 554 cycles ,  2564 ns
Timing Measurement  finished
PASS - main
===================================================================
===================================================================
PROJECT EXECUTION SUCCESSFUL

And with the MPU:

***** Booting Zephyr OS v1.12.0-rc1-313-g7630f784f *****
starting test - Time Measurement
Timing Results: Clock Frequency: 216 MHz
Tick overhead                                : 264 cycles ,  1222 ns
Thread Creation                              :1773 cycles ,  8208 ns
Thread cancel                                : 437 cycles ,  2023 ns
Thread abort                                 :2665 cycles , 12337 ns
Thread Suspend                               :1817 cycles ,  8412 ns
Thread Resume                                :1783 cycles ,  8254 ns
Thread Yield                                 :1521 cycles ,  7041 ns
Thread Sleep                                 :1607 cycles ,  7439 ns
Heap Malloc                                  :1537 cycles ,  7115 ns
Heap Free                                    :1466 cycles ,  6787 ns
Semaphore Take with context switch           :1870 cycles ,  8657 ns
Semaphore Give with context switch           :2093 cycles ,  9689 ns
Semaphore Take without context switch        : 246 cycles ,  1138 ns
Semaphore Give without context switch        : 543 cycles ,  2513 ns
Mutex lock                                   : 444 cycles ,  2055 ns
Mutex unlock                                 : 604 cycles ,  2796 ns
Message Queue Put with context switch        :2259 cycles , 10458 ns
Message Queue Put without context switch     : 696 cycles ,  3222 ns
Message Queue get with context switch        :2418 cycles , 11194 ns
Message Queue get without context switch     : 724 cycles ,  3351 ns
MailBox synchronous put                      :2459 cycles , 11384 ns
MailBox synchronous get                      :2387 cycles , 11050 ns
MailBox asynchronous put                     : 747 cycles ,  3458 ns
MailBox get without context switch           :1577 cycles ,  7300 ns
Timing Measurement  finished
PASS - main
===================================================================
===================================================================
PROJECT EXECUTION SUCCESSFUL

andyross commented 6 years ago

FWIW, just to kibitz on debugging strategy: someone could zero in on one of the simple cases like "Semaphore Give with context switch". There's really not much code on that path, and given that the effect is HUGE (seriously like a thousand extra cycles) it should be easy enough to spot the big deltas with simple tools like k_cycle_get_32(). So if you want to track, say, _Swap() (seems like a context switch and the attendant PendSV exception is a likely spot), you could set a flag right before the call to _reschedule in the thread about to go to sleep, stuff the current cycle count right before _Swap(), then check the flag in _pend_current_thread() right after _Swap() returns in the thread that got woken up (then clear the flag, of course). That might pollute a benchmark result with the extra code, but is just noise in a thousand-cycle delta.

MaureenHelm commented 6 years ago

During the review of #7284, @erwango requested to not activate the MPU on this SoC yet due to possible performances issues. Therefore he might have some hints about what happens.

Sorry for late catch, but I'm not sure about activating MPU right now. MPU for Cortex M7 has specific options which might affect performance. Hence I prefer this is activated specific check/study rather than by default.

The mimxrt1050_evk I posted above is also a Cortex-M7.

nashif commented 6 years ago

95126d1af67ee34e2a3c58843e5b83f8454a6e23 enabled MPU into this board.

carlescufi commented 6 years ago

Just ran this on a Nordic IC:

CONFIG_* values are taken from the .config file generated. Results are very similar with and without MPU enabled, so no issues here.


CONFIG_CPU_HAS_MPU=y
CONFIG_ARM_CORE_MPU=y
CONFIG_ARM_MPU=y
CONFIG_ARM_MPU_NRF52X=y

***** Booting Zephyr OS v1.12.0-rc2-115-g59bf65f48 *****
starting test - Time Measurement
Timing Results: Clock Frequency: 64 MHz
Context switch                               : 144 cycles ,  2232 ns
Interrupt latency                            :  84 cycles ,  1302 ns
Tick overhead                                : 292 cycles ,  4526 ns
Thread Creation                              : 256 cycles ,  3968 ns
Thread cancel                                : 156 cycles ,  2418 ns
Thread abort                                 : 372 cycles ,  5766 ns
Thread Suspend                               : 400 cycles ,  6200 ns
Thread Resume                                : 404 cycles ,  6262 ns
Thread Yield                                 : 332 cycles ,  5146 ns
Thread Sleep                                 : 396 cycles ,  6138 ns
Heap Malloc                                  : 708 cycles , 10974 ns
Heap Free                                    : 660 cycles , 10230 ns
Semaphore Take with context switch           : 456 cycles ,  7068 ns
Semaphore Give with context switch           : 476 cycles ,  7378 ns
Semaphore Take without context switch        :  48 cycles ,   744 ns
Semaphore Give without context switch        : 160 cycles ,  2480 ns
Mutex lock                                   : 156 cycles ,  2418 ns
Mutex unlock                                 : 204 cycles ,  3162 ns
Message Queue Put with context switch        : 600 cycles ,  9300 ns
Message Queue Put without context switch     : 200 cycles ,  3100 ns
Message Queue get with context switch        : 656 cycles , 10168 ns
Message Queue get without context switch     : 208 cycles ,  3224 ns
MailBox synchronous put                      : 740 cycles , 11470 ns
MailBox synchronous get                      : 732 cycles , 11346 ns
MailBox asynchronous put                     : 356 cycles ,  5518 ns
MailBox get without context switch           : 604 cycles ,  9362 ns
Timing Measurement  finished
PASS - main
===================================================================
===================================================================
PROJECT EXECUTION SUCCESSFUL

----------------------------------------------

CONFIG_CPU_HAS_MPU=y
CONFIG_ARM_CORE_MPU=n
CONFIG_ARM_MPU=n
CONFIG_ARM_MPU_NRF52X=n

***** Booting Zephyr OS v1.12.0-rc2-115-g59bf65f48 *****
starting test - Time Measurement
Timing Results: Clock Frequency: 64 MHz
Context switch                               : 144 cycles ,  2232 ns
Interrupt latency                            :  76 cycles ,  1178 ns
Tick overhead                                : 292 cycles ,  4526 ns
Thread Creation                              : 252 cycles ,  3906 ns
Thread cancel                                : 156 cycles ,  2418 ns
Thread abort                                 : 392 cycles ,  6076 ns
Thread Suspend                               : 388 cycles ,  6014 ns
Thread Resume                                : 388 cycles ,  6014 ns
Thread Yield                                 : 332 cycles ,  5146 ns
Thread Sleep                                 : 388 cycles ,  6014 ns
Heap Malloc                                  : 708 cycles , 10974 ns
Heap Free                                    : 660 cycles , 10230 ns
Semaphore Take with context switch           : 460 cycles ,  7130 ns
Semaphore Give with context switch           : 472 cycles ,  7316 ns
Semaphore Take without context switch        :  48 cycles ,   744 ns
Semaphore Give without context switch        : 152 cycles ,  2356 ns
Mutex lock                                   : 156 cycles ,  2418 ns
Mutex unlock                                 : 204 cycles ,  3162 ns
Message Queue Put with context switch        : 588 cycles ,  9114 ns
Message Queue Put without context switch     : 208 cycles ,  3224 ns
Message Queue get with context switch        : 664 cycles , 10292 ns
Message Queue get without context switch     : 220 cycles ,  3410 ns
MailBox synchronous put                      : 768 cycles , 11904 ns
MailBox synchronous get                      : 724 cycles , 11222 ns
MailBox asynchronous put                     : 360 cycles ,  5580 ns
MailBox get without context switch           : 604 cycles ,  9362 ns
Timing Measurement  finished
PASS - main
===================================================================
===================================================================
PROJECT EXECUTION SUCCESSFUL

andyross commented 6 years ago

Think I got it. The caching suggestion from @aurel32 above was the clue, then a little binary search through MPU led me to see that it was the _enable_region() step that hit the slow path. If I let the MPU initialize otherwise but just didn't set up the default regions, then it would stay fast.

So, what's in those region definitions? Oh, heh. NON_CACHABLE.

No idea if this breaks other hardware, but a proper fix should be constructable from this.

agross-oss commented 6 years ago

If SoCs support cache, then the privileged and unprivileged region definitions should be setting cacheable. The macros and region init code needs to take this into account. Using just macros might not be enough, we might need a #ifdef block based on some Kconfig option to denote cache enabled chips.

aurel32 commented 6 years ago

Thanks a lot @andyross. I confirm your patch also fixes the issue on the STM32F7.

tgorochowik commented 6 years ago

Thank you all! I can confirm that when both the fixes from @aurel32 and @andyross are applied the benchmark runs fast.

Just to note: there is at least one thing that breaks and it is the ethernet driver.

There is a FIXME in the header that says it is to be enabled when caches are enabled: https://github.com/zephyrproject-rtos/zephyr/blob/265f502b8e48ba3299bfbbef7814cbab196af0b3/drivers/ethernet/eth_sam_gmac_priv.h#L29

But it seems not to be enough, apparently something is still improperly cleaned/invalidated.

tgorochowik commented 6 years ago

For the sake of completeness:

The problem is that the cache needs to be cleaned/invalidated on the descriptors too (tx_desc, rx_desc). I've actually started working on that. I think I've got TX working well. There are still some issues with RX, hopefully I will be able to provide a proper patch soon, but I guess that's a topic for another issue.

galak commented 6 years ago

I'm wondering for 1.12 if its better to just revert enabling of MPU on the platform and looking at addressing the cache issues after the release. I'm concerned about how this impacts all other platforms as well as ensuring all cache operations are doing the right thing.

aurel32 commented 6 years ago

Let me summarize the issue. The benchmarks show that the SAM E-70 Xplained platform is slower than other similar platforms. It is slow because the Cortex-M7 caches are not enabled. They are not enabled for the following reasons:

The caches are not enabled as the platform initialization code in soc.c does not call SCB_EnableDCache() and SCB_EnableICache().
Enabling the MPU support disable the caches.
The caches can't be enabled as they break the DMA support in the Ethernet driver (likely some missing cache invalidates).

As this is not a regression from 1.11, I believe the best is to defer the fix of this issue to 1.13.

aurel32 commented 6 years ago

Thank you all! I can confirm that when both the fixes from @aurel32 and @andyross are applied the benchmark runs fast.

Just to note: there is at least one thing that breaks and it is the ethernet driver.

One way to get better performances and a working Ethernet driver might be to enable the I-Cache, but not the D-Cache, so only calling SCB_EnableICache().

andyross commented 6 years ago

Project management suggestion: if we have to pick between making the ethernet driver work[1] and fixing benchmarks on this particular board, we have other devices with working network hardware...

[1] Which should be doable with a separate MPU region, right? Just specify one with no caching attributes on the MMIO area if that's the problem, or adjust the existing RAM region size to not include the spot where it needs to DMA? Or maybe I'm not understanding the architecture.

andyross commented 6 years ago

@tgorochowik, not that anyone really wants me kibitzing on this stuff: but looking at the cache management code and finding the M7 reference via Google: the addresses passed to SCB_Clean/InvalidateDCache_by_addr() should be 32-byte-aligned (presumably the cache lines are 32 bytes), and I don't see the code in eth_sam_gmac.c doing that. So assuming the low bits get ignored, you're potentially leaving the last cache line unflushed. With INVALIDATE that's probably benign, but I'm betting the CLEAN operation on tx is failing to flush the ends of your packets.

nashif commented 6 years ago

lowering to medium, will be fixed after 1.12 is released due to complexity of the caching implementation. This issue was always there, so it is not a regression that justifies a high prio bug.

tgorochowik commented 6 years ago

@andyross I am more than happy to read your input!

As I mentioned earlier I already started implementing it and I did notice the 32B alignment requirement, here is my WIP commit from earlier today: https://github.com/antmicro/zephyr/commit/74f25fab6d9c9f90ec4bd2fb232e2850c03906cc

TX actually does work already. There is still something wrong with RX, I guess something is still not invalidated properly (I know this commit is very dirty for now, and all the cleaning and invalidating in the rx part is an overkill, but I will clean it up once it works).

tgorochowik commented 6 years ago

I think It's done, see #8282.

The funny thing is that it did work earlier, but the network at my office prevented me from flooding it with random packets and convinced me that there is still something wrong with the caches.

Note to self: always use a separate network for flood testing!

zephyrproject-rtos / zephyr

Unsatisfactory kernel benchmark results on SAM E-70 Xplained #8138