Use GPU-GPU transfers directly into ghost buffer space in fields

Currently ghost cells are stored in separate buffer from the main field. This leads to complexity in subroutines like der_univ_dist in kernels_dist.f90. We should test:

storing the ghost cell buffer directly inside the field array, eliminating the separate boundary calculations
using CUDA-aware MPI to transfer this buffer region directly between field arrays (currently the buffer is copied into a separate array still on the GPU, then transferred)

At first glance, definitely need to:

change allocator to output arrays that include the ghost cells
update transpose functions to understand new memory layout
update send/recv ghost cell functions to send/recv directly into arrays
update kernels which use ghost cells

This will simplify the distributed tridiagonal solver implementations in all backends. Basically the big chunk that deals with the first 4 entries in the domain will simplify down to a quarter of its current size with a simple loop around it. Also, we obviously won't need any buffer arrays or pass them to the distributed kernels, further simplifying the process. On the performance side though this probably won't have an impact. Apart from simplifying the distributed kernels (also future thomas kernels), this will have an impact in some regions in the codebase. For example the non-ghost regions or the actual region that belong to the rank we're at will look like u(:, 5:n+4, :). Will need to investigate this further to have a better idea.
I think we'll use strided vector stuff as explained in [1]. This is surely supported on CPU's so I did a quick look whether cuda-aware MPI supports this or not. Luckily this is something people looked into [2,3]. But my undersanding from [4] is that an MPI library sorts this out by copying the region you defined in the strided vector type into a buffer array, and then initiates the communication. And [4] states that a bad implementation may require as much space as the large array the strided vector lives in. However a good MPI library can make the performance a bit better as we won't be calling two seperate functions but only one.

Its worth looking into this in detail. Please share your thougths, especially if you have experience with strided data stuff in MPI @Nanoseb @pbartholomew08 @mathrack @rfj82982 @slaizet. @rfj82982, the strided MPI send/recv support can be a good idea for 2DECOMP&FFT as well, what do you think?

[1] https://www.dcs.ed.ac.uk/home/trollius/www.osc.edu/Lam/mpi/mpi_datatypes.html [2] https://icl.utk.edu/files/publications/2016/icl-utk-877-2016.pdf [3] https://web.cels.anl.gov/~thakur/papers/jenkins_cluster12.pdf [4] https://carlpearson.net/pdf/20210420_pearson_phd.pdf

xcompact3d / x3d2

Use GPU-GPU transfers directly into ghost buffer space in fields #28