Closed dleunji closed 1 year ago
Thank you for reaching out. I'm sorry, but I found mistakes in the test code and modified it in the latest commit.
This test computes a matrix multiplication of two n-by-n matrices, A and B, stored in the device memory. To compute it, the matmul function splits the resulting matrix C into two-dimensional 32-by-32 submatrices and assigns one thread block (CTA) to one submatrix. Therefore, each CTA computes a matrix multiplication of 32-by-n matrix, a part of A, and n-by-32 matrices, a part of B. The matmul kernel function computes the matrix multiplication by splitting the matrices into 32-by-256 and 256-by-32 matrices, respectively, again and accumulating the resulting matrices of their multiplication. The variable block_c_row
and block_c_col
are the upper-left position of the submatrix of C that the CTA computes. And in the latest commit, I added a matrix copy function from device memory to shared memory to make it easier to follow the code.
Let me know if you have any questions. Thanks.
Thanks a lot. It helped me to understand the codes and paper!
I agree with your approach so I want to know in more detail about mat mul example.
I got stuck on the mapping a_ptr, b_ptr to F32_smem.
Thank you in advance.