Breakout deallocation calls into simpler smaller files

islas commented 3 months ago

TYPE: enhancement

KEYWORDS: intel, compilation, llvm, memory

SOURCE: internal

DESCRIPTION OF CHANGES: Problem: The Intel oneAPI compilers (and others like nvhpc) struggle with some of the larger (15k+ lines of code) files within WRF. This causes intense memory usage that is not often available to the average user not in a resource-rich environment. This often limits compilation to single threaded if even possible or to a dedicated environment with enough memory if available. If neither of those is available to a user, they will be unable to use these configurations entirely.

Solution: This PR focuses on the deallocs.inc sections of code used in module_domain to reduce the include size to manageable levels. The include is instead broken out into many smaller files as external subroutines. The files are fully generated source code from the registry, with the calls to the subroutines also being generated as well. This also makes it relatively easy to change the number of files generated from a source code perspective. Build rules would need to be modified accordingly as seen in these changes.

TESTS CONDUCTED: Attached to this PR are plots of the respective effects of theses changes. Changes were tested with intel and gcc compilers, but only intel memory usage is shown as it exacerbates the memory usage issue.

islas commented 3 months ago

Highlighted is the region during compilation which memory usage spikes that this PR addresses (module_domain) before these changes take place : pre_module_domain

This usage is then dropped when using this PR's edits : post_module_domain_dealloc_split

Zooming in we can now see that deallocs_ comprise a number of smaller compilation units : post_module_domain_dealloc_split_zoom

weiwangncar commented 3 months ago

@islas This is so cool! Thanks for working on this! Does this affect compile time in any way?

islas commented 3 months ago

It only affects compile times as number of threads go up. For typical compilation with -j 4 things stay more or less the same. For -j 12 as an example there is a decent improvement, and I suspect we would see a similar trend across most compilers if you're able to use that many threads.

Using gfortran/gcc 34/MPI-enabled with ALL PR changes (#2070, #2069, #2068)

Command	Category	Before
time ./compile em_real -j 4
real	7m41.110s	7m55.655s
user	15m22.460s	16m45.250s
sys	0m28.922s	0m31.553s
time ./compile_new -j 4
real	5m19.380s	5m21.952s
user	14m44.306s	16m16.393s
sys	0m26.598s	0m31.263s
time ./compile em_real -j 12
real	7m22.949s	6m36.738s
user	20m49.918s	19m31.420s
sys	0m42.890s	0m39.320s
time ./compile_new -j 12
real	4m25.744s	3m41.141s
user	20m33.427s	22m43.411s
sys	0m36.085s	0m41.981s

weiwangncar commented 1 month ago

I tested code before and after this PR, and model produces identical results in my test. Also with 4 processors, the compile time is about 12 minutes!

weiwangncar commented 1 month ago

The regression test results:

Test Type | Expected | Received | Failed = = = = = = = = = = = = = = = = = = = = = = = = = = = = Number of Tests : 23 24 Number of Builds : 60 57 Number of Simulations : 158 150 0 Number of Comparisons : 95 86 0

Failed Simulations are: 
None
Which comparisons are not bit-for-bit: 
None

wrf-model / WRF

Breakout deallocation calls into simpler smaller files #2070

Using gfortran/gcc 34/MPI-enabled with ALL PR changes (#2070, #2069, #2068)