xcompact3d / x3d2

https://xcompact3d.github.io/x3d2
BSD 3-Clause "New" or "Revised" License
3 stars 4 forks source link

Shall we have a `host_allocator`? #79

Closed semi-h closed 3 months ago

semi-h commented 5 months ago

We are capable of running the entire algorithm on CPUs or GPUs, but there are certain operations we need to carry out on host memory. One example is initialisation (as of now only initializes the velocity fields for a TGV simulation, but it won't be too different for any initial field). https://github.com/xcompact3d/x3d2/blob/eaeae17c7a8b034afd294c20f97569e59dbe6241/src/solver.f90#L116-L128 Because this is a one time operation, it would be a waste to have dedicated functions for each backend. Therefore, we allocate an array on host memory, then fill this up with the initial condition, and simpy set the u, v, and w field_ts with these temporary arrays which will get deallocated soon after. https://github.com/xcompact3d/x3d2/blob/eaeae17c7a8b034afd294c20f97569e59dbe6241/src/solver.f90#L130-L132 This doesn't really cause any ongoing memory issues because initial field arrays are allocated and deallocated before the simulation starts, however the current primitive support for output is a different story. The output Fortran arrays are allocated in the beginning and they're always there until the end, increasing the memory footprint of a simulation running with OpenMP backend. (With CUDA backend, there are 3 fields living in host memory but we don't really care, as the limitation is normally on the GPU memory, and a few host arrays don't cause any harm). Currently we just copy the solution fields back to these output arrays at the end of a simulation. https://github.com/xcompact3d/x3d2/blob/eaeae17c7a8b034afd294c20f97569e59dbe6241/src/solver.f90#L668-L670 This is very primitive and likely change with proper IO but the idea will be the same, if we allocate Fortran arrays to safely copy data back to a host memory, they'll increase our memory footprint on OpenMP backend.

A potential solution

A solution I have in mind involves having a host_allocator, which just points to the default allocator we have in a simulation with OpenMP backend, or on CUDA backend, points to an instance of the allocator_t, which is different than cuda_allocator_t that our default allocator on CUDA backend points to. We can set this up in the main code somewhere around here https://github.com/xcompact3d/x3d2/blob/eaeae17c7a8b034afd294c20f97569e59dbe6241/src/xcompact.f90#L117-L134 And then pass this host_allocator to the solver, where we can do

u_io => solver%host_allocator%get_block(DIR_C)

do i, j, k
   u_io%data(i, j, k) = sin(?)cos(?) ! we're certain %data exists because u_io is obtained from host_allocator
end do
call solver%backend%set_field_data(solver%u, u_io%data, u_io%dir)

call solver%host_allocator%release_block(u_io)

In case we're running OpenMP backend, host_allocator and the allocator are the same instance, so they share the same memory pool and unless we request more from them near the peak memory use which occurs in transeq, it'll only grab an available memory block and operate on it, and make it available for later use.

This will definitely be useful for setting ICs of the domain, and maybe with some test programs (https://github.com/xcompact3d/x3d2/pull/68#discussion_r1530533418_), but probably not worth unless we can make use of this with the proper IO we're planning to have soon. So please share your thoughts, @pbartholomew08 @JamieJQuinn, in particular considering ADIOS and its requirements.

JamieJQuinn commented 5 months ago

Just to clarify, the problem is that there are three, static arrays intended for output that exist for the lifetime of the simulation, but don't need to? So your solution is to allocate and deallocate these arrays (using the host-side allocator) whenever they're required, e.g. here for IC setting, or for IO.

At the moment, my initial drafts of the IO system does this, but it uses its own, local copy of the host allocator. I think there is value in having it access a simulation-wide instance of the allocator to share memory with the rest of the simulation. Unless we do something very clever, the IO system will require some spare memory to convert DIR_* blocks to Cartesian blocks before writing to file.