Open islas opened 3 months ago
Highlighted is the region during compilation which memory usage spikes that this PR addresses (module_domain
) before these changes take place :
This usage is then dropped when using this PR's edits :
Zooming in we can now see that deallocs_ comprise a number of smaller compilation units :
@islas This is so cool! Thanks for working on this! Does this affect compile time in any way?
It only affects compile times as number of threads go up. For typical compilation with -j 4
things stay more or less the same. For -j 12
as an example there is a decent improvement, and I suspect we would see a similar trend across most compilers if you're able to use that many threads.
Command | Category | Before | After |
---|---|---|---|
time ./compile em_real -j 4 | |||
real | 7m41.110s | 7m55.655s | |
user | 15m22.460s | 16m45.250s | |
sys | 0m28.922s | 0m31.553s | |
time ./compile_new -j 4 | |||
real | 5m19.380s | 5m21.952s | |
user | 14m44.306s | 16m16.393s | |
sys | 0m26.598s | 0m31.263s | |
time ./compile em_real -j 12 | |||
real | 7m22.949s | 6m36.738s | |
user | 20m49.918s | 19m31.420s | |
sys | 0m42.890s | 0m39.320s | |
time ./compile_new -j 12 | |||
real | 4m25.744s | 3m41.141s | |
user | 20m33.427s | 22m43.411s | |
sys | 0m36.085s | 0m41.981s |
I tested code before and after this PR, and model produces identical results in my test. Also with 4 processors, the compile time is about 12 minutes!
The regression test results:
Test Type | Expected | Received | Failed = = = = = = = = = = = = = = = = = = = = = = = = = = = = Number of Tests : 23 24 Number of Builds : 60 57 Number of Simulations : 158 150 0 Number of Comparisons : 95 86 0
Failed Simulations are:
None
Which comparisons are not bit-for-bit:
None
TYPE: enhancement
KEYWORDS: intel, compilation, llvm, memory
SOURCE: internal
DESCRIPTION OF CHANGES: Problem: The Intel oneAPI compilers (and others like nvhpc) struggle with some of the larger (15k+ lines of code) files within WRF. This causes intense memory usage that is not often available to the average user not in a resource-rich environment. This often limits compilation to single threaded if even possible or to a dedicated environment with enough memory if available. If neither of those is available to a user, they will be unable to use these configurations entirely.
Solution: This PR focuses on the
deallocs.inc
sections of code used inmodule_domain
to reduce the include size to manageable levels. The include is instead broken out into many smaller files as external subroutines. The files are fully generated source code from the registry, with the calls to the subroutines also being generated as well. This also makes it relatively easy to change the number of files generated from a source code perspective. Build rules would need to be modified accordingly as seen in these changes.TESTS CONDUCTED: Attached to this PR are plots of the respective effects of theses changes. Changes were tested with intel and gcc compilers, but only intel memory usage is shown as it exacerbates the memory usage issue.