Closed goncalo-andrade closed 3 years ago
Hi Gonçalo!
Great question. In fact in our old (now deprecated) SENR code (https://bitbucket.org/zach_etienne/nrpy/src/master/; which didn't have reference metric precomputation [rfm_precompute]) I indeed found that moving the OMP pragma from the i2 loop to the i1 or i0 loops did improve the performance. However, as a warning it wasn't an earth-shattering performance boost! In other words it might be the case that the compiler/OpenMP is being cleverer than we think about parallelizing the outermost loop, or we're just not scaling as well as we hope.
I've had a few ideas that could be tried with the current version of NRPy+, I think the top one should be tried first in fact (though I'm split on whether 2 or 3 should be next):
Good evening all,
I encountered a possible issue with the OpenMP parallelization of NRPy+/SENR. I am outputting C code for the BSSN RHSs in the way that it's done in the "Colliding black holes" tutorial. Currently, the OMP parallelization only happens in the exterior loop, over i2. The problem with this is that, in spherical- or cylindrical-like coordinates, this will be the azimuthal direction which, from my experience, does not require that many points (I typically use 16 for what I'm doing) to accurately sample the physical space. Therefore, if one tries to use more threads than the number of points in direction 2, there is no performance gain.
I tried circumventing this issue by using a custom OMP pragma, in the form "#pragma omp parallel for collapse(3)", but if rfm_precompute is enabled, there are lines of code inserted between the nested loops which result in errors, because with the collapse clause only works if the loops are perfectly nested.
Is there a good way of doing this in NRPy+ implemented right now?
Thank you :)