xcompact3d / x3d2

https://xcompact3d.github.io/x3d2
BSD 3-Clause "New" or "Revised" License
3 stars 4 forks source link

Add a fused distributed kernel for the transport equation #15

Closed semi-h closed 12 months ago

semi-h commented 12 months ago

In order to save some bandwidth on GPUs we fuse some of the operations in the transport equation.

This fused kernel is capable of evaluating $${RHS}_x^u \leftarrow -\frac{1}{2} \bigg(u\frac{\partial u}{\partial x} + \frac{\partial u u}{\partial x}\bigg) + \nu \frac{\partial u^2}{\partial x}$$ or $${RHS}_z^v \leftarrow -\frac{1}{2} \bigg(w\frac{\partial v}{\partial z} + \frac{\partial v w}{\partial z}\bigg) + \nu \frac{\partial v^2}{\partial z}$$ and similar groups of terms in the transport equation depending on the inputs. In total this fused kernel is executed 3 times per direction, and 9 times in total per timestep to evaluate all the terms in the transport equation.

$${RHS}^u = {RHS}_x^u + {RHS}_y^u + {RHS}_z^u$$

$${RHS}^v = {RHS}_x^v + {RHS}_y^v + {RHS}_z^v$$

$${RHS}^w = {RHS}_x^w + {RHS}_y^w + {RHS}_z^w$$

semi-h commented 12 months ago

Latest commit improved performance of the transeq kernel about %20 with respect to its initial version. Now the bandwidth use with respect to the available bandwidth is at around %62 on a single A100. For comparison, the peak bandwidth use of the universal kernel where we solve a single tridiagonal system and output one result is around %72 on an A100. It is expected to see a lower utilisation with the fused kernel as it is more complicated, but I think the performance of the transeq kernel can still be improved a bit more. However for now I leave it as is because the current performance is not bad at all.