yt-project / yt

Main yt repository
http://yt-project.org
Other
469 stars 280 forks source link

SPH pixelization routine is slow #2682

Closed Xarthisius closed 4 years ago

Xarthisius commented 4 years ago

Bug report

Bug summary

One of the cookbook recipes takes a lot of time on testing box:

$ time python doc/source/cookbook/image_resolution.py 

real    16m6.577s
user    16m5.294s
sys 0m4.467s

I think that the pixelize_sph_kernel_projection could be easily parallelized to circumvent this.

matthewturk commented 4 years ago

I poked a little bit at it and was able to get a minor speedup at the cost of some additional complexity (in the neighborhood of 20% maybe?) by doing this diff:

diff --git a/yt/utilities/lib/pixelization_routines.pyx b/yt/utilities/lib/pixelization_routines.pyx
index 147064c28..cfc1d79e2 100644
--- a/yt/utilities/lib/pixelization_routines.pyx
+++ b/yt/utilities/lib/pixelization_routines.pyx
@@ -887,7 +887,9 @@ cdef class SPHKernelInterpolationTable:
     cdef public object kernel_name
     cdef kernel_func kernel
     cdef np.float64_t[::1] table
+    cdef np.float64_t[::1] dtable
     cdef np.float64_t[::1] q2_vals
+    cdef np.float64_t[::1] q2_over_range
     cdef np.float64_t q2_range, iq2_range

     def __init__(self, kernel_name):
@@ -926,15 +928,22 @@ cdef class SPHKernelInterpolationTable:
         cdef int i
         self.table = cvarray(format="d", shape=(TABLE_NVALS,),
                              itemsize=sizeof(np.float64_t))
+        self.dtable = cvarray(format="d", shape=(TABLE_NVALS,),
+                              itemsize=sizeof(np.float64_t))
         self.q2_vals = cvarray(format="d", shape=(TABLE_NVALS,),
                              itemsize=sizeof(np.float64_t))
+        self.q2_over_range = cvarray(format="d", shape=(TABLE_NVALS,),
+                             itemsize=sizeof(np.float64_t))
         # We run from 0 to 1 here over R
         for i in range(TABLE_NVALS):
             self.q2_vals[i] = i * 1.0/(TABLE_NVALS-1)
             self.table[i] = self.integrate_q2(self.q2_vals[i])
-
         self.q2_range = self.q2_vals[TABLE_NVALS-1] - self.q2_vals[0]
         self.iq2_range = (TABLE_NVALS-1)/self.q2_range
+        for i in range(TABLE_NVALS - 1):
+            # we prohibit accessing the final value
+            self.dtable[i] = self.table[i+1] - self.table[i]
+            self.q2_over_range[i] = self.q2_vals[i] * self.iq2_range

     @cython.initializedcheck(False)
     @cython.boundscheck(False)
@@ -942,13 +951,15 @@ cdef class SPHKernelInterpolationTable:
     @cython.cdivision(True)
     cdef inline np.float64_t interpolate(self, np.float64_t q2) nogil:
         cdef int index
+        cdef np.float64_t qf
         cdef np.float64_t F_interpolate
-        index = <int>((q2 - self.q2_vals[0])*(self.iq2_range))
+        qf = q2 * self.iq2_range
+        index = <int>qf
+        # q2_vals[0] is by construction 0
         if index >= TABLE_NVALS:
             return 0.0
         F_interpolate = self.table[index] + (
-                (self.table[index+1] - self.table[index])
-               *(q2 - self.q2_vals[index])*self.iq2_range)
+            self.dtable[index] * (qf - self.q2_over_range[index]))
         return F_interpolate

     def interpolate_array(self, np.float64_t[:] q2_vals):

but I'm not sure it's worth the additional complexity, and the speedup from parallelization is better anyway. This is one of those things that would be really well-suited to GPU computation.

Xarthisius commented 4 years ago

Just noting that this method became serial instead of parallel with https://github.com/yt-project/yt/commit/afc2b0c11633aa873f567d6fc5999db7f906509b and it also shows there are other places where parallelization should be restored.

matthewturk commented 4 years ago

@AshKelly do you know if the cython issue in question has been fixed?

AshKelly commented 4 years ago

Looks like the issue is still open - https://github.com/cython/cython/issues/2316

But I will checkout this repo https://github.com/ngoldbaum/cython-issue and confirm if it has been patched.

Xarthisius commented 4 years ago

Oh, looks like I didn't know OpenMP 4.5's reduction is a thing (not surprisingly since last time I used OpenMP it was at 2.0 :P) and I think my implementation works around it.

AshKelly commented 4 years ago

yeah, local buffers completely avoid it. We toyed with the idea at the time, but Nathan didn't like the extra memory overhead and I think we hoped for a relatively fast fix from the cython side. I think it's complicated fixing from their side since they didn't specify a minimum version of OpenMP / gcc.

Xarthisius commented 4 years ago

As per Matt's suggestion, it's also viable to parallelize the inner loops over pixels. That makes writes to buff atomic (no additional buffers needed) and yields similar speed up in the case presented here.