Current Performance

Without selecting local work size, GEMM performances are below:

float4

[256,256,1] [1,1,1] p+=8 100times 1024x1024x1024 0.220724 s 9.729284 GFLOPS
[256,256,1] [4,4,1] p+=8 100times 1024x1024x1024 0.081261 s 26.427127 GFLOPS
[256,256,1] [4,4,1] p+=12 100times 1024x1024x1020 0.075207 s 28.442905 GFLOPS

half4

[256,256,1] [1,1,1] p+=8 100times 1024x1024x1024 0.103472 s 20.754221 GFLOPS
[256,256,1] [4,4,1] p+=8 100times 1024x1024x1024 0.061210 s 35.084041 GFLOPS
[256,256,1] [4,4,1] p+=12 100times 1024x1024x1024 0.058183 s 36.765208 GFLOPS

Due to concentrating on FP32, don't care half type performance (after optimization of fp32, fp16 will start).

Search Local Work Size

Here, I do some searches for optimal lcoal work size(s) for FP32-float4 using ./mat-mult/hyper-opt/. The local work sizes setting of best performance (above 28GFLOPS) for 1024x1024x1020 and 2048x2048x2040 are (each performance result of corresponding local work size is based on the average of 100-times executions):

# For 1024x1024x1020
>>> [INFO] 1.00 CL_GPU 1024x1024x1020 {256, 256, 1} {1, 32, 1} 0.074372 s 28.762086 GFLOPS
>>> [INFO] 1.00 CL_GPU 1024x1024x1020 {256, 256, 1} {4, 32, 1} 0.074420 s 28.743674 GFLOPS
>>> [INFO] 1.00 CL_GPU 1024x1024x1020 {256, 256, 1} {8, 32, 1} 0.073848 s 28.966047 GFLOPS
>>> [INFO] 1.00 CL_GPU 1024x1024x1020 {256, 256, 1} {1, 16, 1} 0.073795 s 28.986889 GFLOPS
>>> [INFO] 1.00 CL_GPU 1024x1024x1020 {256, 256, 1} {4, 16, 1} 0.074167 s 28.841656 GFLOPS
>>> [INFO] 1.00 CL_GPU 1024x1024x1020 {256, 256, 1} {8, 16, 1} 0.074154 s 28.846475 GFLOPS

# For 2048x2048x2040
>>> [INFO] 1.00 CL_GPU 2048x2048x2040 {512, 512, 1} {8, 32, 1} 0.590064 s 29.001546 GFLOPS
>>> [INFO] 1.00 CL_GPU 2048x2048x2040 {512, 512, 1} {1, 8, 1} 0.591137 s 28.948886 GFLOPS
>>> [INFO] 1.00 CL_GPU 2048x2048x2040 {512, 512, 1} {4, 8, 1} 0.595140 s 28.754175 GFLOPS
>>> [INFO] 1.00 CL_GPU 2048x2048x2040 {512, 512, 1} {4, 16, 1} 0.594070 s 28.805964 GFLOPS

We can find some law: generally speaking, so as to attain the fast setting of local work size, the first dimension is set to a multiple of 4, and the second dimension is set to a multiple of 16. However, it's not always true, but we can be sure these settings of local work size above are great!

The performance results above is an abstract. More detailed performance results're below:

1024x1024x1020

lws_calc_fp32_float4_max_256_gls_1024x1024.log

$ cat lws_calc_fp32_float4_max_256_gls_1024x1024.log | grep "1.00 CL_GPU"
>>> [INFO] 1.00 CL_GPU 1024x1024x1020 {256, 256, 1} {1, 64, 1} 0.152138 s 14.060269 GFLOPS
>>> [INFO] 1.00 CL_GPU 1024x1024x1020 {256, 256, 1} {4, 64, 1} 0.075522 s 28.323962 GFLOPS
>>> [INFO] 1.00 CL_GPU 1024x1024x1020 {256, 256, 1} {64, 1, 1} 0.216090 s 9.899101 GFLOPS
>>> [INFO] 1.00 CL_GPU 1024x1024x1020 {256, 256, 1} {1, 1, 1} 0.325693 s 6.567833 GFLOPS
>>> [INFO] 1.00 CL_GPU 1024x1024x1020 {256, 256, 1} {4, 1, 1} 0.216341 s 9.887623 GFLOPS
>>> [INFO] 1.00 CL_GPU 1024x1024x1020 {256, 256, 1} {32, 1, 1} 0.216321 s 9.888505 GFLOPS
>>> [INFO] 1.00 CL_GPU 1024x1024x1020 {256, 256, 1} {8, 1, 1} 0.216060 s 9.900455 GFLOPS
>>> [INFO] 1.00 CL_GPU 1024x1024x1020 {256, 256, 1} {128, 1, 1} 0.212686 s 10.057528 GFLOPS
>>> [INFO] 1.00 CL_GPU 1024x1024x1020 {256, 256, 1} {16, 1, 1} 0.215956 s 9.905246 GFLOPS
>>> [INFO] 1.00 CL_GPU 1024x1024x1020 {256, 256, 1} {64, 4, 1} 0.190076 s 11.253866 GFLOPS
>>> [INFO] 1.00 CL_GPU 1024x1024x1020 {256, 256, 1} {1, 4, 1} 0.077135 s 27.731752 GFLOPS
>>> [INFO] 1.00 CL_GPU 1024x1024x1020 {256, 256, 1} {4, 4, 1} 0.076620 s 27.918146 GFLOPS
>>> [INFO] 1.00 CL_GPU 1024x1024x1020 {256, 256, 1} {32, 4, 1} 0.076803 s 27.851552 GFLOPS
>>> [INFO] 1.00 CL_GPU 1024x1024x1020 {256, 256, 1} {8, 4, 1} 0.076623 s 27.917206 GFLOPS
>>> [INFO] 1.00 CL_GPU 1024x1024x1020 {256, 256, 1} {16, 4, 1} 0.076525 s 27.952771 GFLOPS
>>> [INFO] 1.00 CL_GPU 1024x1024x1020 {256, 256, 1} {1, 32, 1} 0.074372 s 28.762086 GFLOPS
>>> [INFO] 1.00 CL_GPU 1024x1024x1020 {256, 256, 1} {4, 32, 1} 0.074420 s 28.743674 GFLOPS
>>> [INFO] 1.00 CL_GPU 1024x1024x1020 {256, 256, 1} {8, 32, 1} 0.073848 s 28.966047 GFLOPS
>>> [INFO] 1.00 CL_GPU 1024x1024x1020 {256, 256, 1} {1, 8, 1} 0.074966 s 28.534328 GFLOPS
>>> [INFO] 1.00 CL_GPU 1024x1024x1020 {256, 256, 1} {4, 8, 1} 0.075136 s 28.469790 GFLOPS
>>> [INFO] 1.00 CL_GPU 1024x1024x1020 {256, 256, 1} {32, 8, 1} 0.078450 s 27.266819 GFLOPS
>>> [INFO] 1.00 CL_GPU 1024x1024x1020 {256, 256, 1} {8, 8, 1} 0.075476 s 28.341420 GFLOPS
>>> [INFO] 1.00 CL_GPU 1024x1024x1020 {256, 256, 1} {16, 8, 1} 0.075259 s 28.423166 GFLOPS
>>> [INFO] 1.00 CL_GPU 1024x1024x1020 {256, 256, 1} {1, 128, 1} 0.203697 s 10.501347 GFLOPS
>>> [INFO] 1.00 CL_GPU 1024x1024x1020 {256, 256, 1} {1, 16, 1} 0.073795 s 28.986889 GFLOPS
>>> [INFO] 1.00 CL_GPU 1024x1024x1020 {256, 256, 1} {4, 16, 1} 0.074167 s 28.841656 GFLOPS
>>> [INFO] 1.00 CL_GPU 1024x1024x1020 {256, 256, 1} {8, 16, 1} 0.074154 s 28.846475 GFLOPS
>>> [INFO] 1.00 CL_GPU 1024x1024x1020 {256, 256, 1} {16, 16, 1} 0.076796 s 27.854243 GFLOPS

lws_calc_fp32_float4_max_256_gls_1024x1024

$ cat lws_calc_fp32_float4_max_256_gls_1024x1024.log | grep "1.00 CL_GPU"
>>> [INFO] 1.00 CL_GPU 1024x1024x1020 {256, 256, 1} {1, 64, 1} 0.152138 s 14.060269 GFLOPS
>>> [INFO] 1.00 CL_GPU 1024x1024x1020 {256, 256, 1} {4, 64, 1} 0.075522 s 28.323962 GFLOPS
>>> [INFO] 1.00 CL_GPU 1024x1024x1020 {256, 256, 1} {64, 1, 1} 0.216090 s 9.899101 GFLOPS
>>> [INFO] 1.00 CL_GPU 1024x1024x1020 {256, 256, 1} {1, 1, 1} 0.325693 s 6.567833 GFLOPS
>>> [INFO] 1.00 CL_GPU 1024x1024x1020 {256, 256, 1} {4, 1, 1} 0.216341 s 9.887623 GFLOPS
>>> [INFO] 1.00 CL_GPU 1024x1024x1020 {256, 256, 1} {32, 1, 1} 0.216321 s 9.888505 GFLOPS
>>> [INFO] 1.00 CL_GPU 1024x1024x1020 {256, 256, 1} {8, 1, 1} 0.216060 s 9.900455 GFLOPS
>>> [INFO] 1.00 CL_GPU 1024x1024x1020 {256, 256, 1} {128, 1, 1} 0.212686 s 10.057528 GFLOPS
>>> [INFO] 1.00 CL_GPU 1024x1024x1020 {256, 256, 1} {16, 1, 1} 0.215956 s 9.905246 GFLOPS
>>> [INFO] 1.00 CL_GPU 1024x1024x1020 {256, 256, 1} {64, 4, 1} 0.190076 s 11.253866 GFLOPS
>>> [INFO] 1.00 CL_GPU 1024x1024x1020 {256, 256, 1} {1, 4, 1} 0.077135 s 27.731752 GFLOPS
>>> [INFO] 1.00 CL_GPU 1024x1024x1020 {256, 256, 1} {4, 4, 1} 0.076620 s 27.918146 GFLOPS
>>> [INFO] 1.00 CL_GPU 1024x1024x1020 {256, 256, 1} {32, 4, 1} 0.076803 s 27.851552 GFLOPS
>>> [INFO] 1.00 CL_GPU 1024x1024x1020 {256, 256, 1} {8, 4, 1} 0.076623 s 27.917206 GFLOPS
>>> [INFO] 1.00 CL_GPU 1024x1024x1020 {256, 256, 1} {16, 4, 1} 0.076525 s 27.952771 GFLOPS
>>> [INFO] 1.00 CL_GPU 1024x1024x1020 {256, 256, 1} {1, 32, 1} 0.074372 s 28.762086 GFLOPS
>>> [INFO] 1.00 CL_GPU 1024x1024x1020 {256, 256, 1} {4, 32, 1} 0.074420 s 28.743674 GFLOPS
>>> [INFO] 1.00 CL_GPU 1024x1024x1020 {256, 256, 1} {8, 32, 1} 0.073848 s 28.966047 GFLOPS
>>> [INFO] 1.00 CL_GPU 1024x1024x1020 {256, 256, 1} {1, 8, 1} 0.074966 s 28.534328 GFLOPS
>>> [INFO] 1.00 CL_GPU 1024x1024x1020 {256, 256, 1} {4, 8, 1} 0.075136 s 28.469790 GFLOPS
>>> [INFO] 1.00 CL_GPU 1024x1024x1020 {256, 256, 1} {32, 8, 1} 0.078450 s 27.266819 GFLOPS
>>> [INFO] 1.00 CL_GPU 1024x1024x1020 {256, 256, 1} {8, 8, 1} 0.075476 s 28.341420 GFLOPS
>>> [INFO] 1.00 CL_GPU 1024x1024x1020 {256, 256, 1} {16, 8, 1} 0.075259 s 28.423166 GFLOPS
>>> [INFO] 1.00 CL_GPU 1024x1024x1020 {256, 256, 1} {1, 128, 1} 0.203697 s 10.501347 GFLOPS
>>> [INFO] 1.00 CL_GPU 1024x1024x1020 {256, 256, 1} {1, 16, 1} 0.073795 s 28.986889 GFLOPS
>>> [INFO] 1.00 CL_GPU 1024x1024x1020 {256, 256, 1} {4, 16, 1} 0.074167 s 28.841656 GFLOPS
>>> [INFO] 1.00 CL_GPU 1024x1024x1020 {256, 256, 1} {8, 16, 1} 0.074154 s 28.846475 GFLOPS
>>> [INFO] 1.00 CL_GPU 1024x1024x1020 {256, 256, 1} {16, 16, 1} 0.076796 s 27.854243 GFLOPS

2048x2048x2040

lws_calc_fp32_float4_max_256_gls_2048_2048.log

$ cat lws_calc_fp32_float4_max_256_gls_2048_2048.log | grep "1.00 CL_GPU"
>>> [INFO] 1.00 CL_GPU 2048x2048x2040 {512, 512, 1} {1, 64, 1} 1.618282 s 10.574650 GFLOPS
>>> [INFO] 1.00 CL_GPU 2048x2048x2040 {512, 512, 1} {4, 64, 1} 0.606946 s 28.194880 GFLOPS
>>> [INFO] 1.00 CL_GPU 2048x2048x2040 {512, 512, 1} {64, 1, 1} 5.155221 s 3.319501 GFLOPS
>>> [INFO] 1.00 CL_GPU 2048x2048x2040 {512, 512, 1} {1, 1, 1} 2.594596 s 6.595539 GFLOPS
>>> [INFO] 1.00 CL_GPU 2048x2048x2040 {512, 512, 1} {4, 1, 1} 5.162974 s 3.314516 GFLOPS
>>> [INFO] 1.00 CL_GPU 2048x2048x2040 {512, 512, 1} {32, 1, 1} 5.175718 s 3.306355 GFLOPS
>>> [INFO] 1.00 CL_GPU 2048x2048x2040 {512, 512, 1} {8, 1, 1} 5.152315 s 3.321373 GFLOPS
>>> [INFO] 1.00 CL_GPU 2048x2048x2040 {512, 512, 1} {128, 1, 1} 5.209523 s 3.284900 GFLOPS
>>> [INFO] 1.00 CL_GPU 2048x2048x2040 {512, 512, 1} {16, 1, 1} 5.151586 s 3.321843 GFLOPS
>>> [INFO] 1.00 CL_GPU 2048x2048x2040 {512, 512, 1} {64, 4, 1} 1.828126 s 9.360821 GFLOPS
>>> [INFO] 1.00 CL_GPU 2048x2048x2040 {512, 512, 1} {1, 4, 1} 0.637480 s 26.844399 GFLOPS
>>> [INFO] 1.00 CL_GPU 2048x2048x2040 {512, 512, 1} {4, 4, 1} 0.623718 s 27.436699 GFLOPS
>>> [INFO] 1.00 CL_GPU 2048x2048x2040 {512, 512, 1} {32, 4, 1} 0.616397 s 27.762548 GFLOPS
>>> [INFO] 1.00 CL_GPU 2048x2048x2040 {512, 512, 1} {8, 4, 1} 0.611600 s 27.980329 GFLOPS
>>> [INFO] 1.00 CL_GPU 2048x2048x2040 {512, 512, 1} {16, 4, 1} 0.612205 s 27.952679 GFLOPS
>>> [INFO] 1.00 CL_GPU 2048x2048x2040 {512, 512, 1} {1, 32, 1} 0.599180 s 28.560282 GFLOPS
>>> [INFO] 1.00 CL_GPU 2048x2048x2040 {512, 512, 1} {4, 32, 1} 0.606689 s 28.206826 GFLOPS
>>> [INFO] 1.00 CL_GPU 2048x2048x2040 {512, 512, 1} {8, 32, 1} 0.590064 s 29.001546 GFLOPS
>>> [INFO] 1.00 CL_GPU 2048x2048x2040 {512, 512, 1} {1, 8, 1} 0.591137 s 28.948886 GFLOPS
>>> [INFO] 1.00 CL_GPU 2048x2048x2040 {512, 512, 1} {4, 8, 1} 0.595140 s 28.754175 GFLOPS
>>> [INFO] 1.00 CL_GPU 2048x2048x2040 {512, 512, 1} {32, 8, 1} 0.619850 s 27.607898 GFLOPS
>>> [INFO] 1.00 CL_GPU 2048x2048x2040 {512, 512, 1} {8, 8, 1} 0.597600 s 28.635798 GFLOPS
>>> [INFO] 1.00 CL_GPU 2048x2048x2040 {512, 512, 1} {16, 8, 1} 0.598308 s 28.601906 GFLOPS
>>> [INFO] 1.00 CL_GPU 2048x2048x2040 {512, 512, 1} {1, 128, 1} 1.676186 s 10.209347 GFLOPS
>>> [INFO] 1.00 CL_GPU 2048x2048x2040 {512, 512, 1} {1, 16, 1} 0.600695 s 28.488251 GFLOPS
>>> [INFO] 1.00 CL_GPU 2048x2048x2040 {512, 512, 1} {4, 16, 1} 0.594070 s 28.805964 GFLOPS
>>> [INFO] 1.00 CL_GPU 2048x2048x2040 {512, 512, 1} {8, 16, 1} 0.598946 s 28.571441 GFLOPS
>>> [INFO] 1.00 CL_GPU 2048x2048x2040 {512, 512, 1} {16, 16, 1} 0.631024 s 27.119014 GFLOPS

Optimization for float type. Next step is to try to change these points in gemm_interleave_trans.c:

global work size
the shape or load format of aI and bT
the shape or store format of c
try mix use of loadN or storeN or floatN
refer more kernels implementations: Add other OCL mini-projects/demos from github · Issue #1 and other GEMM implementations in ACL.

ysh329 / OpenCL-101

gemm optimization for FP32 #23