Closed ysh329 closed 6 years ago
Before benmark, set performace mode first:
$ sudo ../tools/ck-print-gpu-freq
*** Current GPU frequency:
800000000
*** Min frequency:
200000000
*** Max frequency:
800000000
*** Available GPU frequencies:
200000000 297000000 400000000 500000000 594000000 800000000
*** Current GPU governor:
performance
*** Available GPU governor:
userspace powersave performance simple_ondemand
1024 x 1024 x 1024
type | execution time (second) | GFLOPS |
---|---|---|
int32 | 0.5871 | 3.657402 |
float32 | 0.5907 | 3.635007 |
double64 | 0.7542 | 2.847346 |
half16 | 0.5055 | 4.247821 |
Stable. Cost 0.590778
seconds and corresponding gflops: 3.635007
GFLOPS.
============== INIT ==============
>>> [INFO] ELEM_TYPE_STR: float, sizeof(ELEM_TYPE): 4
>>> [INFO] CL_ELEM_TYPE_STR: float, sizeof(CL_ELEM_TYPE): 4
>>> [INFO] len_a: 1048576, len_b: 1048576, len_c: 1048576
>>> [INFO] data_size_a: 4194304, data_size_b: 4194304, data_size_c: 4194304
============== CPU RESULT ==============
>>> [INFO] 10 times CPU starting...
0 32.135118
>>> [INFO] skip first 1 time(s)
1 31.683712
2 31.796608
3 32.048721
4 31.663059
5 32.179186
6 31.730488
7 32.137520
8 31.663593
9 31.636981
10 31.600799
>>> [INFO] CPU 1024x1024x1024 31.814067 s 0.067501 GFLOPS
============== GPU RESULT ==============
>>> [INFO] Device name: Mali-T86x MP4 r2p0 0x0860
>>> [INFO] program_file: ./kernel.cl, kernel_func: mat_mult_naive
>>> [INFO] global_work_size[3]: { 1024, 1024, 1 } [8/411]
>>> [INFO] CL_GPU 10 times ./kernel.cl.mat_mult_naive starting ...
0 0.578418
>>> [INFO] skip first 1 time(s)
1 0.574513
2 0.624397
3 0.575214
4 0.578982
5 0.575077
6 0.608436
7 0.575884
8 0.624576
9 0.594593
10 0.576112
gflops: 2147483648.000000
ave_duration: 0.590778
>>> [INFO] CL_GPU 1024x1024x1024 0.590778 s 3.635007 GFLOPS
>>> [TEST] correct rate: 1.0000
>>> [TEST] ~ Bingo ~ matrix a == matrix b
The benchmark result is not stable: naive matrix multiplication costs 0.58~0.77
seconds, corresponding gflops between 2.9~3.5
GFLOPS.
$ ./matrixMultiplication 1024 1024 1024 ./kernelmat_mult_naive 5 1024 1024 1 [10/70]
============== INIT ==============
>>> [INFO] ELEM_TYPE_STR: int, sizeof(ELEM_TYPE): 4
>>> [INFO] CL_ELEM_TYPE_STR: int, sizeof(CL_ELEM_TYPE): 4
>>> [INFO] len_a: 1048576, len_b: 1048576, len_c: 1048576
>>> [INFO] data_size_a: 4194304, data_size_b: 4194304, data_size_c: 4194304
============== CPU RESULT ==============
>>> [INFO] 5 times CPU starting...
0 49.456371
>>> [INFO] skip first 1 time(s)
1 49.445682
2 51.025633
3 49.424161
4 50.706564
5 50.874110
>>> [INFO] CPU 1024x1024x1024 50.295230 s 0.042698 GFLOPS
============== GPU RESULT ==============
>>> [INFO] Device name: Mali-T86x MP4 r2p0 0x0860
>>> [INFO] program_file: ./kernel.cl, kernel_func: mat_mult_naive
>>> [INFO] global_work_size[3]: { 1024, 1024, 1 }
>>> [INFO] CL_GPU 5 times ./kernel.cl.mat_mult_naive starting ...
0 0.666985
>>> [INFO] skip first 1 time(s)
1 0.628585
2 0.577195
3 0.577648
4 0.576896
5 0.575481
gflops: 2147483648.000000
ave_duration: 0.587161
>>> [INFO] CL_GPU 1024x1024x1024 0.587161 s 3.657402 GFLOPS
>>> [TEST] correct rate: 1.0000
Stable. Cost 0.550
seconds and corresponding gflops: 4.21~4.24
GFLOPS.
$ ./matrixMultiplication 1024 1024 1024 ./kernel.cl mat_mult_naive 10 1024 1024 1
============== INIT ==============
>>> [INFO] ELEM_TYPE_STR: __fp16, sizeof(ELEM_TYPE): 2
>>> [INFO] CL_ELEM_TYPE_STR: half, sizeof(CL_ELEM_TYPE): 2
>>> [INFO] len_a: 1048576, len_b: 1048576, len_c: 1048576
>>> [INFO] data_size_a: 2097152, data_size_b: 2097152, data_size_c: 2097152
============== CPU RESULT ==============
>>> [INFO] 10 times CPU starting...
0 22.410335
>>> [INFO] skip first 1 time(s)
1 22.575216
2 22.340271
3 22.540361
4 22.486432
5 22.455014
6 22.280294
7 22.382801
8 22.492505
9 22.183580
10 22.428460
>>> [INFO] CPU 1024x1024x1024 22.416493 s 0.095799 GFLOPS
============== GPU RESULT ==============
>>> [INFO] Device name: Mali-T86x MP4 r2p0 0x0860
>>> [INFO] program_file: ./kernel.cl, kernel_func: mat_mult_naive
>>> [INFO] global_work_size[3]: { 1024, 1024, 1 }
>>> [INFO] CL_GPU 10 times ./kernel.cl.mat_mult_naive starting ...
0 0.511446
>>> [INFO] skip first 1 time(s)
1 0.506768
2 0.504923
3 0.505440
4 0.505335
5 0.505887
6 0.504761
7 0.505133
8 0.505742
9 0.506590
10 0.504916
gflops: 2147483648.000000
ave_duration: 0.505549
>>> [INFO] CL_GPU 1024x1024x1024 0.505549 s 4.247821 GFLOPS
>>> [TEST] correct rate: 1.0000
>>> [TEST] ~ Bingo ~ matrix a == matrix b
============== INIT =============
>>> [INFO] ELEM_TYPE_STR: double, sizeof(ELEM_TYPE): 8
>>> [INFO] CL_ELEM_TYPE_STR: double, sizeof(CL_ELEM_TYPE): 8
>>> [INFO] len_a: 1048576, len_b: 1048576, len_c: 1048576
>>> [INFO] data_size_a: 8388608, data_size_b: 8388608, data_size_c: 8388608
============== CPU RESULT ==============
>>> [INFO] 10 times CPU starting...
0 62.132808
>>> [INFO] skip first 1 time(s)
1 62.124912
2 62.114249
3 62.107852
4 62.124306
5 62.125422
6 62.110446
7 62.129060
8 62.106102
9 63.081955
10 65.068961
>>> [INFO] CPU 1024x1024x1024 62.509327 s 0.034355 GFLOPS
============== GPU RESULT ==============
>>> [INFO] Device name: Mali-T86x MP4 r2p0 0x0860
>>> [INFO] program_file: ./kernel.cl, kernel_func: mat_mult_naive
>>> [INFO] global_work_size[3]: { 1024, 1024, 1 }
>>> [INFO] CL_GPU 10 times ./kernel.cl.mat_mult_naive starting ...
0 0.758725
>>> [INFO] skip first 1 time(s)
1 0.759127
2 0.752441
3 0.737415
4 0.738173
5 0.736296
6 0.790745
7 0.796274
8 0.735750
9 0.741224
10 0.754608
gflops: 2147483648.000000
ave_duration: 0.754205
>>> [INFO] CL_GPU 1024x1024x1024 0.754205 s 2.847346 GFLOPS
>>> [TEST] correct rate: 1.0000
>>> [TEST] ~ Bingo ~ matrix a == matrix b
$ ./matrixMultiplication 1024 1024 1024 ./kernel.cl mat_mult_vec2 0 20 $[1024] $[1024] 1 [21/825]
============== INIT ==============
>>> [INFO] ELEM_TYPE_STR: float, sizeof(ELEM_TYPE): 4
>>> [INFO] CL_ELEM_TYPE_STR: float2, sizeof(CL_ELEM_TYPE): 8
>>> [WARN] ELEM_TYPE(float) size differs from CL_ELEM_TYPE(float2)
>>> [INFO] len_a: 1048576, len_b: 1048576, len_c: 1048576
>>> [INFO] data_size_a: 4194304, data_size_b: 4194304, data_size_c: 4194304
============== CPU RESULT ==============
>>> [INFO] 0 times CPU starting...
0 32.072621
>>> [INFO] skip first 1 time(s)
>>> [INFO] CPU 1024x1024x1024 nan s nan GFLOPS
============== GPU RESULT ==============
>>> [INFO] Device name: Mali-T86x MP4 r2p0 0x0860
>>> [INFO] program_file: ./kernel.cl, kernel_func: mat_mult_vec2
>>> [INFO] global_work_size[3]: { 1024, 1024, 1 }
>>> [INFO] CL_GPU 20 times ./kernel.cl.mat_mult_vec2 starting ...
0 0.745514
>>> [INFO] skip first 1 time(s)
1 0.664003
2 0.682694
3 0.705091
4 0.664341
5 0.665731
6 0.665375
7 0.663946
8 0.665404
9 0.704201
10 0.678438
11 0.699856
12 0.740078
13 0.664164
14 0.681739
15 0.662163
16 0.664076
17 0.664025
18 0.669523
19 0.690585
20 0.665842
>>> [INFO] CL_GPU 1024x1024x1024 0.678064 s 3.167082 GFLOPS
>>> [TEST] correct rate: 1.0000
>>> [TEST] ~ Bingo ~ matrix a == matrix b
$ ./matrixMultiplication 1024 1024 1024 ./kernel.cl mat_mult_vec2 0 20 $[1024] $[1024] 1 [21/929]
============== INIT ==============
>>> [INFO] ELEM_TYPE_STR: int, sizeof(ELEM_TYPE): 4
>>> [INFO] CL_ELEM_TYPE_STR: int2, sizeof(CL_ELEM_TYPE): 8
>>> [WARN] ELEM_TYPE(int) size differs from CL_ELEM_TYPE(int2)
>>> [INFO] len_a: 1048576, len_b: 1048576, len_c: 1048576
>>> [INFO] data_size_a: 4194304, data_size_b: 4194304, data_size_c: 4194304
============== CPU RESULT ==============
>>> [INFO] 0 times CPU starting...
0 42.547083
>>> [INFO] skip first 1 time(s)
>>> [INFO] CPU 1024x1024x1024 nan s nan GFLOPS
============== GPU RESULT ==============
>>> [INFO] Device name: Mali-T86x MP4 r2p0 0x0860
>>> [INFO] program_file: ./kernel.cl, kernel_func: mat_mult_vec2
>>> [INFO] global_work_size[3]: { 1024, 1024, 1 }
>>> [INFO] CL_GPU 20 times ./kernel.cl.mat_mult_vec2 starting ...
0 0.729240
>>> [INFO] skip first 1 time(s)
1 0.663346
2 0.663381
3 0.665677
4 0.688709
5 0.667141
6 0.675088
7 0.668971
8 0.677944
9 0.677716
10 0.662854
11 0.683053
12 0.663949
13 0.707164
14 0.664728
15 0.667829
16 0.669991
17 0.661924
18 0.691719
19 0.660779
20 0.726561
>>> [INFO] CL_GPU 1024x1024x1024 0.675426 s 3.179450 GFLOPS
>>> [TEST] correct rate: 1.0000
>>> [TEST] ~ Bingo ~ matrix a == matrix b
$ ./matrixMultiplication 1024 1024 1024 ./kernel.cl mat_mult_vec2 0 20 $[1024] $[1024] 1 [20/1029]
============== INIT ==============
>>> [INFO] ELEM_TYPE_STR: __fp16, sizeof(ELEM_TYPE): 2
>>> [INFO] CL_ELEM_TYPE_STR: half2, sizeof(CL_ELEM_TYPE): 2
>>> [INFO] len_a: 1048576, len_b: 1048576, len_c: 1048576
>>> [INFO] data_size_a: 2097152, data_size_b: 2097152, data_size_c: 2097152
============== CPU RESULT ==============
>>> [INFO] 0 times CPU starting...
0 21.096589
>>> [INFO] skip first 1 time(s)
>>> [INFO] CPU 1024x1024x1024 nan s nan GFLOPS
============== GPU RESULT ==============
>>> [INFO] Device name: Mali-T86x MP4 r2p0 0x0860
>>> [INFO] program_file: ./kernel.cl, kernel_func: mat_mult_vec2
>>> [INFO] global_work_size[3]: { 1024, 1024, 1 }
>>> [INFO] CL_GPU 20 times ./kernel.cl.mat_mult_vec2 starting ...
0 0.697708
>>> [INFO] skip first 1 time(s)
1 0.598590
2 0.598986
3 0.599458
4 0.598561
5 0.598886
6 0.600000
7 0.600150
8 0.603970
9 0.600956
10 0.601270
11 0.597891
12 0.598698
13 0.598759
14 0.599476
15 0.598256
16 0.599068
17 0.598711
18 0.599195
19 0.599339
20 0.599145
>>> [INFO] CL_GPU 1024x1024x1024 0.599468 s 3.582314 GFLOPS
>>> [TEST] correct rate: 1.0000
>>> [TEST] ~ Bingo ~ matrix a == matrix b
$ ./matrixMultiplication 1024 1024 1024 ./vec4.cl mat_mult_vec4 0 20 $[1024] $[1024] 1
============== INIT ==============
>>> [INFO] ELEM_TYPE_STR: __fp16, sizeof(ELEM_TYPE): 2
>>> [INFO] CL_ELEM_TYPE_STR: half4, sizeof(CL_ELEM_TYPE): 2
>>> [INFO] len_a: 1048576, len_b: 1048576, len_c: 1048576
>>> [INFO] data_size_a: 2097152, data_size_b: 2097152, data_size_c: 2097152
============== CPU RESULT ==============
>>> [INFO] 0 times CPU starting...
0 24.388975
>>> [INFO] skip first 1 time(s)
>>> [INFO] CPU 1024x1024x1024 nan s nan GFLOPS
============== GPU RESULT ==============
>>> [INFO] Device name: Mali-T86x MP4 r2p0 0x0860
>>> [INFO] program_file: ./vec4.cl, kernel_func: mat_mult_vec4
>>> [INFO] global_work_size[3]: { 1024, 1024, 1 }
>>> [INFO] CL_GPU 20 times ./vec4.cl.mat_mult_vec4 starting ...
0 0.785168
>>> [INFO] skip first 1 time(s)
1 0.785488
2 0.785155
3 0.784971
4 0.785315
5 0.784797
6 0.784791
7 0.785390
8 0.785067
9 0.784999
10 0.784745
11 0.786073
12 0.784571
13 0.786385
14 0.784829
15 0.785454
16 0.784674
17 0.784963
18 0.786262
19 0.784866
20 0.785740
>>> [INFO] CL_GPU 1024x1024x1024 0.785227 s 2.734858 GFLOPS
>>> [TEST] correct rate: 1.0000
>>> [TEST] ~ Bingo ~ matrix a == matrix b
$ ./matrixMultiplication 1024 1024 1024 ./vec4.cl mat_mult_vec4 0 20 $[1024] $[1024] 1 [22/1482]
============== INIT ==============
>>> [INFO] ELEM_TYPE_STR: int, sizeof(ELEM_TYPE): 4
>>> [INFO] CL_ELEM_TYPE_STR: int4, sizeof(CL_ELEM_TYPE): 4
>>> [INFO] len_a: 1048576, len_b: 1048576, len_c: 1048576
>>> [INFO] data_size_a: 4194304, data_size_b: 4194304, data_size_c: 4194304
============== CPU RESULT ==============
>>> [INFO] 0 times CPU starting...
0 50.925559
>>> [INFO] skip first 1 time(s)
>>> [INFO] CPU 1024x1024x1024 nan s nan GFLOPS
============== GPU RESULT ==============
>>> [INFO] Device name: Mali-T86x MP4 r2p0 0x0860
>>> [INFO] program_file: ./vec4.cl, kernel_func: mat_mult_vec4
>>> [INFO] global_work_size[3]: { 1024, 1024, 1 }
>>> [INFO] CL_GPU 20 times ./vec4.cl.mat_mult_vec4 starting ...
0 1.227366
>>> [INFO] skip first 1 time(s)
1 1.232017
2 1.230932
3 1.227194
4 1.231709
5 1.232883
6 1.229543
7 1.228786
8 1.229696
9 1.227157
10 1.227646
11 1.227292
12 1.227880
13 1.228065
14 1.231296
15 1.226779
16 1.227354
17 1.227589
18 1.227081
19 1.229803
20 1.231007
>>> [INFO] CL_GPU 1024x1024x1024 1.229085 s 1.747221 GFLOPS
>>> [TEST] correct rate: 1.0000
>>> [TEST] ~ Bingo ~ matrix a == matrix b
$ ./matrixMultiplication 1024 1024 1024 ./vec4.cl mat_mult_vec4 0 20 $[1024] $[1024] 1 [22/1383]
============== INIT ==============
>>> [INFO] ELEM_TYPE_STR: float, sizeof(ELEM_TYPE): 4
>>> [INFO] CL_ELEM_TYPE_STR: float4, sizeof(CL_ELEM_TYPE): 4
>>> [INFO] len_a: 1048576, len_b: 1048576, len_c: 1048576
>>> [INFO] data_size_a: 4194304, data_size_b: 4194304, data_size_c: 4194304
============== CPU RESULT ==============
>>> [INFO] 0 times CPU starting...
0 33.002391
>>> [INFO] skip first 1 time(s)
>>> [INFO] CPU 1024x1024x1024 nan s nan GFLOPS
============== GPU RESULT ==============
>>> [INFO] Device name: Mali-T86x MP4 r2p0 0x0860
>>> [INFO] program_file: ./vec4.cl, kernel_func: mat_mult_vec4
>>> [INFO] global_work_size[3]: { 1024, 1024, 1 }
>>> [INFO] CL_GPU 20 times ./vec4.cl.mat_mult_vec4 starting ...
0 1.252484
>>> [INFO] skip first 1 time(s)
1 1.250876
2 1.249836
3 1.249639
4 1.224961
5 1.238079
6 1.249810
7 1.250827
8 1.253317
9 1.250303
10 1.248669
11 1.250466
12 1.249963
13 1.249999
14 1.249263
15 1.248959
16 1.250244
17 1.249647
18 1.249784
19 1.249077
20 1.248922
>>> [INFO] CL_GPU 1024x1024x1024 1.248132 s 1.720558 GFLOPS
>>> [TEST] correct rate: 1.0000
>>> [TEST] ~ Bingo ~ matrix a == matrix b
$ ./matrixMultiplication 1024 1024 1024 ./vec8.cl mat_mult_vec8 0 20 $[1024] $[1024] 1 [22/1585]
============== INIT ==============
>>> [INFO] ELEM_TYPE_STR: int, sizeof(ELEM_TYPE): 4
>>> [INFO] CL_ELEM_TYPE_STR: int8, sizeof(CL_ELEM_TYPE): 4
>>> [INFO] len_a: 1048576, len_b: 1048576, len_c: 1048576
>>> [INFO] data_size_a: 4194304, data_size_b: 4194304, data_size_c: 4194304
============== CPU RESULT ==============
>>> [INFO] 0 times CPU starting...
0 48.183118
>>> [INFO] skip first 1 time(s)
>>> [INFO] CPU 1024x1024x1024 nan s nan GFLOPS
============== GPU RESULT ==============
>>> [INFO] Device name: Mali-T86x MP4 r2p0 0x0860
>>> [INFO] program_file: ./vec8.cl, kernel_func: mat_mult_vec8
>>> [INFO] global_work_size[3]: { 1024, 1024, 1 }
>>> [INFO] CL_GPU 20 times ./vec8.cl.mat_mult_vec8 starting ...
0 1.245643
>>> [INFO] skip first 1 time(s)
1 1.247249
2 1.246787
3 1.247569
4 1.247150
5 1.246996
6 1.248807
7 1.246896
8 1.247675
9 1.248380
10 1.245348
11 1.249068
12 1.248908
13 1.248970
14 1.247565
15 1.251570
16 1.248810
17 1.250007
18 1.247615
19 1.248424
20 1.248822
>>> [INFO] CL_GPU 1024x1024x1024 1.248131 s 1.720560 GFLOPS
>>> [TEST] correct rate: 1.0000
>>> [TEST] ~ Bingo ~ matrix a == matrix b
$ ./matrixMultiplication 1024 1024 1024 ./vec8.cl mat_mult_vec8 0 20 $[1024] $[1024] 1 [22/1684]
============== INIT ==============
>>> [INFO] ELEM_TYPE_STR: __fp16, sizeof(ELEM_TYPE): 2
>>> [INFO] CL_ELEM_TYPE_STR: half8, sizeof(CL_ELEM_TYPE): 2
>>> [INFO] len_a: 1048576, len_b: 1048576, len_c: 1048576
>>> [INFO] data_size_a: 2097152, data_size_b: 2097152, data_size_c: 2097152
============== CPU RESULT ==============
>>> [INFO] 0 times CPU starting...
0 21.785163
>>> [INFO] skip first 1 time(s)
>>> [INFO] CPU 1024x1024x1024 nan s nan GFLOPS
============== GPU RESULT ==============
>>> [INFO] Device name: Mali-T86x MP4 r2p0 0x0860
>>> [INFO] program_file: ./vec8.cl, kernel_func: mat_mult_vec8
>>> [INFO] global_work_size[3]: { 1024, 1024, 1 }
>>> [INFO] CL_GPU 20 times ./vec8.cl.mat_mult_vec8 starting ...
0 1.064921
>>> [INFO] skip first 1 time(s)
1 1.060214
2 1.063404
3 1.066602
4 1.067873
5 1.063292
6 1.058918
7 1.064117
8 1.061240
9 1.066587
10 1.062076
11 1.065420
12 1.063350
13 1.059808
14 1.066744
15 1.065193
16 1.062735
17 1.067862
18 1.066636
19 1.063445
20 1.067638
>>> [INFO] CL_GPU 1024x1024x1024 1.064158 s 2.018013 GFLOPS
>>> [TEST] correct rate: 1.0000
>>> [TEST] ~ Bingo ~ matrix a == matrix b
$ ./matrixMultiplication 1024 1024 1024 ./vec8.cl mat_mult_vec8 0 20 $[1024] $[1024] 1 [22/1784]
============== INIT ==============
>>> [INFO] ELEM_TYPE_STR: float, sizeof(ELEM_TYPE): 4
>>> [INFO] CL_ELEM_TYPE_STR: float8, sizeof(CL_ELEM_TYPE): 4
>>> [INFO] len_a: 1048576, len_b: 1048576, len_c: 1048576
>>> [INFO] data_size_a: 4194304, data_size_b: 4194304, data_size_c: 4194304
============== CPU RESULT ==============
>>> [INFO] 0 times CPU starting...
0 32.317702
>>> [INFO] skip first 1 time(s)
>>> [INFO] CPU 1024x1024x1024 nan s nan GFLOPS
============== GPU RESULT ==============
>>> [INFO] Device name: Mali-T86x MP4 r2p0 0x0860
>>> [INFO] program_file: ./vec8.cl, kernel_func: mat_mult_vec8
>>> [INFO] global_work_size[3]: { 1024, 1024, 1 }
>>> [INFO] CL_GPU 20 times ./vec8.cl.mat_mult_vec8 starting ...
0 1.227704
>>> [INFO] skip first 1 time(s)
1 1.225342
2 1.226772
3 1.223779
4 1.225490
5 1.226510
6 1.226443
7 1.227413
8 1.225230
9 1.222004
10 1.226666
11 1.225248
12 1.224709
13 1.223577
14 1.224604
15 1.225972
16 1.225790
17 1.225214
18 1.224669
19 1.225131
20 1.225476
>>> [INFO] CL_GPU 1024x1024x1024 1.225302 s 1.752616 GFLOPS
>>> [TEST] correct rate: 1.0000
>>> [TEST] ~ Bingo ~ matrix a == matrix b
$ ./matrixMultiplication 1024 1024 1024 ./vec16.cl mat_mult_vec16 0 20 $[1024] $[1024] 1 [22/243]
============== INIT ==============
>>> [INFO] ELEM_TYPE_STR: float, sizeof(ELEM_TYPE): 4
>>> [INFO] CL_ELEM_TYPE_STR: float16, sizeof(CL_ELEM_TYPE): 4
>>> [INFO] len_a: 1048576, len_b: 1048576, len_c: 1048576
>>> [INFO] data_size_a: 4194304, data_size_b: 4194304, data_size_c: 4194304
============== CPU RESULT ==============
>>> [INFO] 0 times CPU starting...
0 32.313083
>>> [INFO] skip first 1 time(s)
>>> [INFO] CPU 1024x1024x1024 nan s nan GFLOPS
============== GPU RESULT ==============
>>> [INFO] Device name: Mali-T86x MP4 r2p0 0x0860
>>> [INFO] program_file: ./vec16.cl, kernel_func: mat_mult_vec16
>>> [INFO] global_work_size[3]: { 1024, 1024, 1 }
>>> [INFO] CL_GPU 20 times ./vec16.cl.mat_mult_vec16 starting ...
0 1.584943
>>> [INFO] skip first 1 time(s)
1 1.570398
2 1.582975
3 1.572269
4 1.585434
5 1.571658
6 1.585436
7 1.572811
8 1.583057
9 1.572311
10 1.585094
11 1.573029
12 1.588597
13 1.572048
14 1.585594
15 1.571509
16 1.588572
17 1.571195
18 1.588734
19 1.578679
20 1.586458
>>> [INFO] CL_GPU 1024x1024x1024 1.579293 s 1.359775 GFLOPS
>>> [TEST] correct rate: 1.0000
>>> [TEST] ~ Bingo ~ matrix a == matrix b
$ ./matrixMultiplication 1024 1024 1024 ./vec16.cl mat_mult_vec16 0 20 $[1024] $[1024] 1 [22/297]
============== INIT ==============
>>> [INFO] ELEM_TYPE_STR: __fp16, sizeof(ELEM_TYPE): 2
>>> [INFO] CL_ELEM_TYPE_STR: half16, sizeof(CL_ELEM_TYPE): 2
>>> [INFO] len_a: 1048576, len_b: 1048576, len_c: 1048576
>>> [INFO] data_size_a: 2097152, data_size_b: 2097152, data_size_c: 2097152
============== CPU RESULT ==============
>>> [INFO] 0 times CPU starting...
0 21.933312
>>> [INFO] skip first 1 time(s)
>>> [INFO] CPU 1024x1024x1024 nan s nan GFLOPS
============== GPU RESULT ==============
>>> [INFO] Device name: Mali-T86x MP4 r2p0 0x0860
>>> [INFO] program_file: ./vec16.cl, kernel_func: mat_mult_vec16
>>> [INFO] global_work_size[3]: { 1024, 1024, 1 }
>>> [INFO] CL_GPU 20 times ./vec16.cl.mat_mult_vec16 starting ...
0 1.113329
>>> [INFO] skip first 1 time(s)
1 1.109171
2 1.112040
3 1.118125
4 1.108358
5 1.112967
6 1.115943
7 1.109845
8 1.109008
9 1.110358
10 1.110518
11 1.105643
12 1.113373
13 1.110546
14 1.109875
15 1.106689
16 1.116894
17 1.114391
18 1.115844
19 1.108951
20 1.108160
>>> [INFO] CL_GPU 1024x1024x1024 1.111335 s 1.932346 GFLOPS
$ ./matrixMultiplication 1024 1024 1024 ./vec16.cl mat_mult_vec16 0 20 $[1024] $[1024] 1 [22/351]
============== INIT ==============
>>> [INFO] ELEM_TYPE_STR: int, sizeof(ELEM_TYPE): 4
>>> [INFO] CL_ELEM_TYPE_STR: int16, sizeof(CL_ELEM_TYPE): 4
>>> [INFO] len_a: 1048576, len_b: 1048576, len_c: 1048576
>>> [INFO] data_size_a: 4194304, data_size_b: 4194304, data_size_c: 4194304
============== CPU RESULT ==============
>>> [INFO] 0 times CPU starting...
0 44.288010
>>> [INFO] skip first 1 time(s)
>>> [INFO] CPU 1024x1024x1024 nan s nan GFLOPS
============== GPU RESULT ==============
>>> [INFO] Device name: Mali-T86x MP4 r2p0 0x0860
>>> [INFO] program_file: ./vec16.cl, kernel_func: mat_mult_vec16
>>> [INFO] global_work_size[3]: { 1024, 1024, 1 }
>>> [INFO] CL_GPU 20 times ./vec16.cl.mat_mult_vec16 starting ...
0 1.560015
>>> [INFO] skip first 1 time(s)
1 1.549140
2 1.556859
3 1.546036
4 1.557132
5 1.550042
6 1.560190
7 1.551438
8 1.563459
9 1.548408
10 1.557484
11 1.551499
12 1.558829
13 1.551887
14 1.558873
15 1.549608
16 1.557619
17 1.553162
18 1.558806
19 1.553744
20 1.558716
>>> [INFO] CL_GPU 1024x1024x1024 1.554647 s 1.381332 GFLOPS
>>> [TEST] correct rate: 1.0000
>>> [TEST] ~ Bingo ~ matrix a == matrix b
The scale of matrix multiplication is 1024 x 1024 x 1024, and task size equals to global work size situation (unit is seconds):
int 0.587161 int2 0.675426 int4 1.229085 int8 1.248131 int16 1.554647
float 0.590778 float2 0.678064 float4 1.248132 float8 1.225302 float16 1.579293
half 0.505549 half2 0.599468 half4 0.785227 half8 1.064158 half16 1.111335
NV Card
Temporarily benchmark on NV Card
M: 1024; N: 1024; K: 1024
Naive Kernel