Without selecting local work size, GEMM performances are below:
float4
[256,256,1] [1,1,1] p+=8 100times 1024x1024x1024 0.220724 s 9.729284 GFLOPS
[256,256,1] [4,4,1] p+=8 100times 1024x1024x1024 0.081261 s 26.427127 GFLOPS
[256,256,1] [4,4,1] p+=12 100times 1024x1024x1020 0.075207 s 28.442905 GFLOPS
half4
[256,256,1] [1,1,1] p+=8 100times 1024x1024x1024 0.103472 s 20.754221 GFLOPS
[256,256,1] [4,4,1] p+=8 100times 1024x1024x1024 0.061210 s 35.084041 GFLOPS
[256,256,1] [4,4,1] p+=12 100times 1024x1024x1024 0.058183 s 36.765208 GFLOPS
Due to concentrating on FP32, don't care half type performance (after optimization of fp32, fp16 will start).
Search Local Work Size
Here, I do some searches for optimal lcoal work size(s) for FP32-float4 using ./mat-mult/hyper-opt/. The local work sizes setting of best performance (above 28GFLOPS) for 1024x1024x1020 and 2048x2048x2040 are (each performance result of corresponding local work size is based on the average of 100-times executions):
We can find some law: generally speaking, so as to attain the fast setting of local work size, the first dimension is set to a multiple of 4, and the second dimension is set to a multiple of 16. However, it's not always true, but we can be sure these settings of local work size above are great!
The performance results above is an abstract. More detailed performance results're below:
Current Performance
Without selecting
local work size
, GEMM performances are below:float4
half4
Due to concentrating on FP32, don't care
half
type performance (after optimization of fp32, fp16 will start).Search Local Work Size
Here, I do some searches for optimal lcoal work size(s) for
FP32-float4
using./mat-mult/hyper-opt/
. The local work sizes setting of best performance (above 28GFLOPS) for1024x1024x1020
and2048x2048x2040
are (each performance result of correspondinglocal work size
is based on the average of 100-times executions):We can find some law: generally speaking, so as to attain the fast setting of local work size, the first dimension is set to a multiple of 4, and the second dimension is set to a multiple of 16. However, it's not always true, but we can be sure these settings of local work size above are great!
The performance results above is an abstract. More detailed performance results're below:
1024x1024x1020
lws_calc_fp32_float4_max_256_gls_1024x1024.log
lws_calc_fp32_float4_max_256_gls_1024x1024
2048x2048x2040
lws_calc_fp32_float4_max_256_gls_2048_2048.log
Optimization for float type. Next step is to try to change these points in
gemm_interleave_trans.c
:aI
andbT
c
loadN
orstoreN
orfloatN