zachetienne / nrpytutorial

NRPy+, BlackHoles@Home, SENRv2, and the NRPy+ Jupyter Tutorial: Python-Based Code Generation for Numerical Relativity... and Beyond!
https://blackholesathome.net
BSD 2-Clause "Simplified" License
88 stars 39 forks source link

OMP pragmas for higher core counts #21

Closed goncalo-andrade closed 3 years ago

goncalo-andrade commented 4 years ago

Good evening all,

I encountered a possible issue with the OpenMP parallelization of NRPy+/SENR. I am outputting C code for the BSSN RHSs in the way that it's done in the "Colliding black holes" tutorial. Currently, the OMP parallelization only happens in the exterior loop, over i2. The problem with this is that, in spherical- or cylindrical-like coordinates, this will be the azimuthal direction which, from my experience, does not require that many points (I typically use 16 for what I'm doing) to accurately sample the physical space. Therefore, if one tries to use more threads than the number of points in direction 2, there is no performance gain.

I tried circumventing this issue by using a custom OMP pragma, in the form "#pragma omp parallel for collapse(3)", but if rfm_precompute is enabled, there are lines of code inserted between the nested loops which result in errors, because with the collapse clause only works if the loops are perfectly nested.

Is there a good way of doing this in NRPy+ implemented right now?

Thank you :)

zachetienne commented 4 years ago

Hi Gonçalo!

Great question. In fact in our old (now deprecated) SENR code (https://bitbucket.org/zach_etienne/nrpy/src/master/; which didn't have reference metric precomputation [rfm_precompute]) I indeed found that moving the OMP pragma from the i2 loop to the i1 or i0 loops did improve the performance. However, as a warning it wasn't an earth-shattering performance boost! In other words it might be the case that the compiler/OpenMP is being cleverer than we think about parallelizing the outermost loop, or we're just not scaling as well as we hope.

I've had a few ideas that could be tried with the current version of NRPy+, I think the top one should be tried first in fact (though I'm split on whether 2 or 3 should be next):

  1. Disable rfm_precompute and benchmark the performance with the omp pragma hand placed both in the outer loop and the inner loops. What is the speedup when we move to the inner loops? If it's <10% for example, then we might want to rethink proceeding further.
  2. Create an option to permute the spherical or cylindrical coordinates. Other than in some diagnostics, there should be no hard-coded assumption that e.g., spherical is (xx0,xx1,xx2)=(r,theta,phi), instead of (xx0,xx1,xx2)=(theta,phi,r); notice the latter is also a right-handed coordinate system. A spherical coordinate system with such permuted coordinates may scale far better in OpenMP loops.
  3. Fix the problem you are seeing with rfm_precompute. I think it may be possible to simply move the pragma from just before the i2 for() statement to just before the i1 for() statement by hand if necessary. If that works, we could add an option to loop.py to specify the placement of the #pragma omp parallel.