wrf-model / WRF

The official repository for the Weather Research and Forecasting (WRF) model
Other
1.24k stars 677 forks source link

Breakout deallocation calls into simpler smaller files #2070

Open islas opened 3 months ago

islas commented 3 months ago

TYPE: enhancement

KEYWORDS: intel, compilation, llvm, memory

SOURCE: internal

DESCRIPTION OF CHANGES: Problem: The Intel oneAPI compilers (and others like nvhpc) struggle with some of the larger (15k+ lines of code) files within WRF. This causes intense memory usage that is not often available to the average user not in a resource-rich environment. This often limits compilation to single threaded if even possible or to a dedicated environment with enough memory if available. If neither of those is available to a user, they will be unable to use these configurations entirely.

Solution: This PR focuses on the deallocs.inc sections of code used in module_domain to reduce the include size to manageable levels. The include is instead broken out into many smaller files as external subroutines. The files are fully generated source code from the registry, with the calls to the subroutines also being generated as well. This also makes it relatively easy to change the number of files generated from a source code perspective. Build rules would need to be modified accordingly as seen in these changes.

TESTS CONDUCTED: Attached to this PR are plots of the respective effects of theses changes. Changes were tested with intel and gcc compilers, but only intel memory usage is shown as it exacerbates the memory usage issue.

islas commented 3 months ago

Highlighted is the region during compilation which memory usage spikes that this PR addresses (module_domain) before these changes take place : pre_module_domain

This usage is then dropped when using this PR's edits : post_module_domain_dealloc_split

Zooming in we can now see that deallocs_ comprise a number of smaller compilation units : post_module_domain_dealloc_split_zoom

weiwangncar commented 3 months ago

@islas This is so cool! Thanks for working on this! Does this affect compile time in any way?

islas commented 3 months ago

It only affects compile times as number of threads go up. For typical compilation with -j 4 things stay more or less the same. For -j 12 as an example there is a decent improvement, and I suspect we would see a similar trend across most compilers if you're able to use that many threads.

Using gfortran/gcc 34/MPI-enabled with ALL PR changes (#2070, #2069, #2068)

Command Category Before After
time ./compile em_real -j 4
real 7m41.110s 7m55.655s
user 15m22.460s 16m45.250s
sys 0m28.922s 0m31.553s
time ./compile_new -j 4
real 5m19.380s 5m21.952s
user 14m44.306s 16m16.393s
sys 0m26.598s 0m31.263s
time ./compile em_real -j 12
real 7m22.949s 6m36.738s
user 20m49.918s 19m31.420s
sys 0m42.890s 0m39.320s
time ./compile_new -j 12
real 4m25.744s 3m41.141s
user 20m33.427s 22m43.411s
sys 0m36.085s 0m41.981s
weiwangncar commented 1 month ago

I tested code before and after this PR, and model produces identical results in my test. Also with 4 processors, the compile time is about 12 minutes!

weiwangncar commented 1 month ago

The regression test results:

Test Type | Expected | Received | Failed = = = = = = = = = = = = = = = = = = = = = = = = = = = = Number of Tests : 23 24 Number of Builds : 60 57 Number of Simulations : 158 150 0 Number of Comparisons : 95 86 0

Failed Simulations are: 
None
Which comparisons are not bit-for-bit: 
None