Shall we have a `host_allocator`?

We are capable of running the entire algorithm on CPUs or GPUs, but there are certain operations we need to carry out on host memory. One example is initialisation (as of now only initializes the velocity fields for a TGV simulation, but it won't be too different for any initial field). https://github.com/xcompact3d/x3d2/blob/eaeae17c7a8b034afd294c20f97569e59dbe6241/src/solver.f90#L116-L128 Because this is a one time operation, it would be a waste to have dedicated functions for each backend. Therefore, we allocate an array on host memory, then fill this up with the initial condition, and simpy set the u, v, and w field_ts with these temporary arrays which will get deallocated soon after. https://github.com/xcompact3d/x3d2/blob/eaeae17c7a8b034afd294c20f97569e59dbe6241/src/solver.f90#L130-L132 This doesn't really cause any ongoing memory issues because initial field arrays are allocated and deallocated before the simulation starts, however the current primitive support for output is a different story. The output Fortran arrays are allocated in the beginning and they're always there until the end, increasing the memory footprint of a simulation running with OpenMP backend. (With CUDA backend, there are 3 fields living in host memory but we don't really care, as the limitation is normally on the GPU memory, and a few host arrays don't cause any harm). Currently we just copy the solution fields back to these output arrays at the end of a simulation. https://github.com/xcompact3d/x3d2/blob/eaeae17c7a8b034afd294c20f97569e59dbe6241/src/solver.f90#L668-L670 This is very primitive and likely change with proper IO but the idea will be the same, if we allocate Fortran arrays to safely copy data back to a host memory, they'll increase our memory footprint on OpenMP backend.

A potential solution

A solution I have in mind involves having a host_allocator, which just points to the default allocator we have in a simulation with OpenMP backend, or on CUDA backend, points to an instance of the allocator_t, which is different than cuda_allocator_t that our default allocator on CUDA backend points to. We can set this up in the main code somewhere around here https://github.com/xcompact3d/x3d2/blob/eaeae17c7a8b034afd294c20f97569e59dbe6241/src/xcompact.f90#L117-L134 And then pass this host_allocator to the solver, where we can do

u_io => solver%host_allocator%get_block(DIR_C)

do i, j, k
   u_io%data(i, j, k) = sin(?)cos(?) ! we're certain %data exists because u_io is obtained from host_allocator
end do
call solver%backend%set_field_data(solver%u, u_io%data, u_io%dir)

call solver%host_allocator%release_block(u_io)

In case we're running OpenMP backend, host_allocator and the allocator are the same instance, so they share the same memory pool and unless we request more from them near the peak memory use which occurs in transeq, it'll only grab an available memory block and operate on it, and make it available for later use.

This will definitely be useful for setting ICs of the domain, and maybe with some test programs (https://github.com/xcompact3d/x3d2/pull/68#discussion_r1530533418_), but probably not worth unless we can make use of this with the proper IO we're planning to have soon. So please share your thoughts, @pbartholomew08 @JamieJQuinn, in particular considering ADIOS and its requirements.

xcompact3d / x3d2

Shall we have a `host_allocator`? #79

A potential solution