This is due to us going above the maxiumum thread count per block when we set SZ=64. Some of the reorder kernels require a 2D thread with SZ*SZ threads. We need to fix these kernels so that when SZ=64 or above these kernels work on tiles of 32 by 32.
This is due to us going above the maxiumum thread count per block when we set SZ=64. Some of the reorder kernels require a 2D thread with SZ*SZ threads. We need to fix these kernels so that when SZ=64 or above these kernels work on tiles of 32 by 32.