[RFC] DMA and Data cache coherency on arm M7 devices

FRASTM commented 3 years ago

Introduction

On the ARM-M7 core-based stm32 MCU, when the L1-Cache memory is enabled, the system needs to "ensure data coherency between the core and the main memory when dealing with shared data", especially with DMA transfers.

This is detailed in the [AN4839: Level 1 cache on STM32F7 Series and STM32H7 Series] (https://www.st.com/content/ccc/resource/technical/document/application_note/group0/08/dd/25/9c/4d/83/43/12/DM00272913/files/DM00272913.pdf/jcr:content/translations/en.DM00272913.pdf) or other documents from various manufacturer

Problem description

When using the DMA for peripheral transfers like spi or uart, or other clients on M7 devices, typically stm32f7xx or stm32h7xx, it is recommended:

If the software is using cacheable memory regions for the DMA source/or destination buffers. The software must trigger a cache clean before starting a DMA operation to ensure that all the data are committed to the subsystem memory. After the DMA transfer complete, when reading the data from the peripheral, the software must perform a cache invalidate before reading the DMA updated memory region.
Always better to use non-cacheable regions for DMA buffers. The software can use the MPU to set up a non-cacheable memory block to use as a shared memory between the CPU and DMA.
Do not enable cache for the memory that is being used extensively for a DMA operation.

Proposed change

item 1: control the cache coherency on dma buffers

DMA buffers are aligned on the cache line-size User-application buffers are memcopy to/from those DMA buffers with padding and alignement.

On each dma transfer Rx or Tx the flush and invalidate operations are required

1.1 each client of the dma has to control the cache coherency on their DMA operations Tx, Rx --> this has to be controlled by each driver which uses the dma transfers (typically spi, i2s, uart, etc) with similar code to be adapted. Example for spi client in the PR "[RFC] Initial support for cache handling when doing SPI/DMA on STM32F7" #27911

1.2 the DMA driver controls the cache coherency on its own DMA operations Tx, Rx --> not validated, yet but PR above could help

item 2: Always better to use non-cacheable regions for DMA buffers

These buffers must have RW-access These buffers are allocated in the NoCache ram area. These buffers might be for DMA use only or for the dma-client as-well

2.1 We map a special NONCACHE memory Area for the dma buffers (Tx and Rx) Some memcopy operations are required to exchange data between the initial client allocated buffers and that special Non-chached DMA buffer. --> We could see a significant overhead due to memcopy, especially when the buffers are small and frequently transferred

2.2 We map a special NONCACHE memory Area for the dma-client buffers (Tx and Rx) These buffers are allocated in the User area, and used for the client and dma exchange The user applications must allocate its buffer in the user Nocache Ram area PR "run test on stm32F7 with CONFIG_UART_ASYNC_API and DMA" #32833 --> this is a significant constraint to map user buffers in this NoCache area, else memcopy are also needed.

item 3: "Do not enable cache"

We simply disable the Data cache when DMA is enabled on stm32 M7-based MCUs 3.1 statically, based on the DT dma Example in the PR "soc: stm32f7 remove cache memory with dma transfer" #35165

This is most most simple solution to avoid data cache coherency problems. And this is funtionnal. In the soc/arm/st_stm32/stm32f7/soc.c or soc/arm/st_stm32/stm32h7/soc.c, do not enable D-cache in case of DMA or explicitely defined by the CONFIG_NOCACHE_MEMORY

+#ifndef CONFIG_NOCACHE_MEMORY
    if (!(SCB->CCR & SCB_CCR_DC_Msk)) {
        SCB_EnableDCache();
    }
+#endif

3.1 dynamically: disable the cache when starting/initialising the dma device

Concerns and Unresolved Questions

impact of the memcopy operations in terms of performance
impact on the system of disabling the cache of stm32f7 or h7 MCUs as soon as the DMA is enabled
constraint on the user application to allocate their buffers in noncached ram area

FRASTM commented 3 years ago

As also stated by Arm M7: Multiple Bus Masters Where we have multiple bus masters (e.g. a DMA controller), runtime cache maintenance operations may be required. For example, sharing a cacheable memory location between the processor and a DMA Controller for buffering:

If the memory location updated by the Cortex-M7 processor, is being used as a write-buffer and has to be accessed by another bus master, a cache clean is needed to ensure the other bus master can see the new data. If the memory location is acting as a read-buffer and updated by a different bus master, the Cortex-M7 processor must do a cache invalidate so that next time it reads the memory location, it will fetch the information from the main memory system. An alternative (simpler) approach is to mark a region of the RAM used by the multiple-bus masters (i.e. the buffer) as sharable, ensuring the memory is never cached (but obviously with a performance hit).

erwango commented 3 years ago

@nandojve, @MaureenHelm, @ioannisg, @dcpleung, @tbursztyka, @galak, @aurel32

I think this point deserves some alignment so we can offer a workable solution for M7 users that'd like to use DMA & USERSPACE.

lmajewski commented 2 years ago

@FRASTM - I'm wondering if the

+#ifndef CONFIG_NOCACHE_MEMORY if (!(SCB->CCR & SCB_CCR_DC_Msk)) { SCB_EnableDCache(); } +#endif

#ifndef CONFIG_NOCACHE_MEMORY on SCB_EnableDCache() could be removed and solely replaced with allocating buffers in the no-cache MPU regions (defined via __nocache attribute) ?

Disabling DCache has severe performance consequences and causes problems in other IP blocks (recently SDMMC IP block requred HW Flow Control enabling to continue to work after enabling CONFIG_NOCACHE_MEMORY).

lmajewski commented 2 years ago

@FRASTM - Do you plan to do some more work on providing more coarse management of DCache on STM32H7 ?

dcpleung commented 2 years ago

@teburd is working on DMA so he might have some thoughts.

lmajewski commented 2 years ago

@teburd - I would be very grateful if you could share the current status of the DMA code development. I'm particularly interested in the aspect of D-Cache management. In short I would like to use DMA's CONFIG_NOCACHE_MEMORY with D-Cache enabled as well.

FRASTM commented 2 years ago

@FRASTM - Do you plan to do some more work on providing more coarse management of DCache on STM32H7 ?

not in the near future

teburd commented 2 years ago

@lmajewski To be frank the use of CONFIG_NOCACHE_MEMORY is a vendor specific issue, as each DMA driver and hardware peripheral is unique. So does the driver do the work of invalidating/flushing caches when needed? Each vendor's driver is likely to have a different answer at the moment.

There's nothing at the API level (dma.h) that really enforces the usage of a particular cache scheme.

There was some nxp work recently that exposed a few of those issues as well that might be an interesting read: https://github.com/zephyrproject-rtos/zephyr/pull/42750

NickolasLapp commented 2 years ago

For a little context, at least in my experience on NXP devices, the Zephyr DMA driver API as currently written is not easily compatible with having the DMA driver manage cache coherency issues, due to platform-specific cache requirements (like having DMA buffers aligned on cache line size). I spent some time trying to follow 1.2 above:

1.2 the DMA driver controls the cache coherency on its own DMA operations Tx, Rx

which would have had the NXP DMA driver clean the relevant cache lines before starting a DMA transfer, and invalidate the relevant cache lines after a DMA receive.

The issue I ran into with doing this generically is that there's no requirement or even mention of alignment for the buffers passed to DMA transfers. Because cache clean and invalidate operations operate on a full cache line (32 bytes for CM7), problems can easily arise if an unsuspecting user calls into the dma tx/rx functions with a buffer that is not aligned to the cache line size, or whose size is not a multiple of the cache line size.

For the NXP driver, I still am not positive what the proper solution here is. For our application, we've ended up just doing:

item 2: Always better to use non-cacheable regions for DMA buffers

and allocating our DMA buffers into non-cacheable memory, but that doesn't really resolve the problem for the generic DMA API which probably should handle both cacheable and non-cacheable memory.

teburd commented 2 years ago

That's actually pretty insightful @NickolasLapp and something I need to think on

lmajewski commented 2 years ago

For the NXP driver, I still am not positive what the proper solution here is. For our application, we've ended up just doing:

item 2: Always better to use non-cacheable regions for DMA buffers

and allocating our DMA buffers into non-cacheable memory, but that doesn't really resolve the problem for the generic DMA API which probably should handle both cacheable and non-cacheable memory.

But the above approach is far more better than disabling the D-Cache as it is not for STM32H7.

zephyrproject-rtos / zephyr