Closed semi-h closed 12 months ago
Latest commit improved performance of the transeq kernel about %20 with respect to its initial version. Now the bandwidth use with respect to the available bandwidth is at around %62 on a single A100. For comparison, the peak bandwidth use of the universal kernel where we solve a single tridiagonal system and output one result is around %72 on an A100. It is expected to see a lower utilisation with the fused kernel as it is more complicated, but I think the performance of the transeq kernel can still be improved a bit more. However for now I leave it as is because the current performance is not bad at all.
In order to save some bandwidth on GPUs we fuse some of the operations in the transport equation.
This fused kernel is capable of evaluating $${RHS}_x^u \leftarrow -\frac{1}{2} \bigg(u\frac{\partial u}{\partial x} + \frac{\partial u u}{\partial x}\bigg) + \nu \frac{\partial u^2}{\partial x}$$ or $${RHS}_z^v \leftarrow -\frac{1}{2} \bigg(w\frac{\partial v}{\partial z} + \frac{\partial v w}{\partial z}\bigg) + \nu \frac{\partial v^2}{\partial z}$$ and similar groups of terms in the transport equation depending on the inputs. In total this fused kernel is executed 3 times per direction, and 9 times in total per timestep to evaluate all the terms in the transport equation.
$${RHS}^u = {RHS}_x^u + {RHS}_y^u + {RHS}_z^u$$
$${RHS}^v = {RHS}_x^v + {RHS}_y^v + {RHS}_z^v$$
$${RHS}^w = {RHS}_x^w + {RHS}_y^w + {RHS}_z^w$$