Closed Xarthisius closed 4 years ago
I poked a little bit at it and was able to get a minor speedup at the cost of some additional complexity (in the neighborhood of 20% maybe?) by doing this diff:
diff --git a/yt/utilities/lib/pixelization_routines.pyx b/yt/utilities/lib/pixelization_routines.pyx
index 147064c28..cfc1d79e2 100644
--- a/yt/utilities/lib/pixelization_routines.pyx
+++ b/yt/utilities/lib/pixelization_routines.pyx
@@ -887,7 +887,9 @@ cdef class SPHKernelInterpolationTable:
cdef public object kernel_name
cdef kernel_func kernel
cdef np.float64_t[::1] table
+ cdef np.float64_t[::1] dtable
cdef np.float64_t[::1] q2_vals
+ cdef np.float64_t[::1] q2_over_range
cdef np.float64_t q2_range, iq2_range
def __init__(self, kernel_name):
@@ -926,15 +928,22 @@ cdef class SPHKernelInterpolationTable:
cdef int i
self.table = cvarray(format="d", shape=(TABLE_NVALS,),
itemsize=sizeof(np.float64_t))
+ self.dtable = cvarray(format="d", shape=(TABLE_NVALS,),
+ itemsize=sizeof(np.float64_t))
self.q2_vals = cvarray(format="d", shape=(TABLE_NVALS,),
itemsize=sizeof(np.float64_t))
+ self.q2_over_range = cvarray(format="d", shape=(TABLE_NVALS,),
+ itemsize=sizeof(np.float64_t))
# We run from 0 to 1 here over R
for i in range(TABLE_NVALS):
self.q2_vals[i] = i * 1.0/(TABLE_NVALS-1)
self.table[i] = self.integrate_q2(self.q2_vals[i])
-
self.q2_range = self.q2_vals[TABLE_NVALS-1] - self.q2_vals[0]
self.iq2_range = (TABLE_NVALS-1)/self.q2_range
+ for i in range(TABLE_NVALS - 1):
+ # we prohibit accessing the final value
+ self.dtable[i] = self.table[i+1] - self.table[i]
+ self.q2_over_range[i] = self.q2_vals[i] * self.iq2_range
@cython.initializedcheck(False)
@cython.boundscheck(False)
@@ -942,13 +951,15 @@ cdef class SPHKernelInterpolationTable:
@cython.cdivision(True)
cdef inline np.float64_t interpolate(self, np.float64_t q2) nogil:
cdef int index
+ cdef np.float64_t qf
cdef np.float64_t F_interpolate
- index = <int>((q2 - self.q2_vals[0])*(self.iq2_range))
+ qf = q2 * self.iq2_range
+ index = <int>qf
+ # q2_vals[0] is by construction 0
if index >= TABLE_NVALS:
return 0.0
F_interpolate = self.table[index] + (
- (self.table[index+1] - self.table[index])
- *(q2 - self.q2_vals[index])*self.iq2_range)
+ self.dtable[index] * (qf - self.q2_over_range[index]))
return F_interpolate
def interpolate_array(self, np.float64_t[:] q2_vals):
but I'm not sure it's worth the additional complexity, and the speedup from parallelization is better anyway. This is one of those things that would be really well-suited to GPU computation.
Just noting that this method became serial instead of parallel with https://github.com/yt-project/yt/commit/afc2b0c11633aa873f567d6fc5999db7f906509b and it also shows there are other places where parallelization should be restored.
@AshKelly do you know if the cython issue in question has been fixed?
Looks like the issue is still open - https://github.com/cython/cython/issues/2316
But I will checkout this repo https://github.com/ngoldbaum/cython-issue and confirm if it has been patched.
Oh, looks like I didn't know OpenMP 4.5's reduction is a thing (not surprisingly since last time I used OpenMP it was at 2.0 :P) and I think my implementation works around it.
yeah, local buffers completely avoid it. We toyed with the idea at the time, but Nathan didn't like the extra memory overhead and I think we hoped for a relatively fast fix from the cython side. I think it's complicated fixing from their side since they didn't specify a minimum version of OpenMP / gcc.
As per Matt's suggestion, it's also viable to parallelize the inner loops over pixels. That makes writes to buff
atomic (no additional buffers needed) and yields similar speed up in the case presented here.
Bug report
Bug summary
One of the cookbook recipes takes a lot of time on testing box:
I think that the
pixelize_sph_kernel_projection
could be easily parallelized to circumvent this.