wmmae / wmma_extension

An extension library of WMMA API (Tensor Core API)
https://arxiv.org/abs/2308.15152
MIT License
82 stars 14 forks source link

In matmul, could you tell me mapping to shared memory in more detail? #2

Closed dleunji closed 1 year ago

dleunji commented 1 year ago

I agree with your approach so I want to know in more detail about mat mul example.

I got stuck on the mapping a_ptr, b_ptr to F32_smem.

Thank you in advance.

enp1s0 commented 1 year ago

Thank you for reaching out. I'm sorry, but I found mistakes in the test code and modified it in the latest commit.

This test computes a matrix multiplication of two n-by-n matrices, A and B, stored in the device memory. To compute it, the matmul function splits the resulting matrix C into two-dimensional 32-by-32 submatrices and assigns one thread block (CTA) to one submatrix. Therefore, each CTA computes a matrix multiplication of 32-by-n matrix, a part of A, and n-by-32 matrices, a part of B. The matmul kernel function computes the matrix multiplication by splitting the matrices into 32-by-256 and 256-by-32 matrices, respectively, again and accumulating the resulting matrices of their multiplication. The variable block_c_row and block_c_col are the upper-left position of the submatrix of C that the CTA computes. And in the latest commit, I added a matrix copy function from device memory to shared memory to make it easier to follow the code.

Let me know if you have any questions. Thanks.

dleunji commented 1 year ago

Thanks a lot. It helped me to understand the codes and paper!