ysh329 / OpenCL-101

Learn OpenCL step by step.
123 stars 31 forks source link

benchmark for various type (floatN, intN, halfN, doubleN, shortN) using naive implementation #11

Closed ysh329 closed 6 years ago

ysh329 commented 6 years ago

NV Card

Temporarily benchmark on NV Card

__kernel void mat_mult_naive(const int M, const int N, const int K, __global const CL_INPUT_TYPE *a, __global const CL_INPUT_TYPE *b, __global CL_INPUT_TYPE *c) {
    const int col = get_global_id(0);
    const int row = get_global_id(1);
    CL_ELEM_TYPE res = 0;

    for (int p = 0; p < K; p++) {
        res += a[row * M + p] * b[p * N + col];
    }   
    c[row * N + col] = res;
}
ysh329 commented 6 years ago

Mali-T860

Before benmark, set performace mode first:

$ sudo ../tools/ck-print-gpu-freq                                                                                    
*** Current GPU frequency:
800000000
*** Min frequency:
200000000
*** Max frequency:
800000000
*** Available GPU frequencies:
200000000 297000000 400000000 500000000 594000000 800000000
*** Current GPU governor:
performance
*** Available GPU governor:
userspace powersave performance simple_ondemand

Naive kernel

1024 x 1024 x 1024

type execution time (second) GFLOPS
int32 0.5871 3.657402
float32 0.5907 3.635007
double64 0.7542 2.847346
half16 0.5055 4.247821

float

Stable. Cost 0.590778 seconds and corresponding gflops: 3.635007 GFLOPS.

============== INIT ==============
>>> [INFO] ELEM_TYPE_STR: float, sizeof(ELEM_TYPE): 4
>>> [INFO] CL_ELEM_TYPE_STR: float, sizeof(CL_ELEM_TYPE): 4
>>> [INFO] len_a: 1048576, len_b: 1048576, len_c: 1048576
>>> [INFO] data_size_a: 4194304, data_size_b: 4194304, data_size_c: 4194304

============== CPU RESULT ==============
>>> [INFO] 10 times CPU starting...
0        32.135118
>>> [INFO] skip first 1 time(s)
1        31.683712
2        31.796608
3        32.048721
4        31.663059
5        32.179186
6        31.730488
7        32.137520
8        31.663593
9        31.636981
10       31.600799
>>> [INFO] CPU 1024x1024x1024 31.814067 s 0.067501 GFLOPS

============== GPU RESULT ==============
>>> [INFO] Device name: Mali-T86x MP4 r2p0 0x0860
>>> [INFO] program_file: ./kernel.cl, kernel_func: mat_mult_naive
>>> [INFO] global_work_size[3]: { 1024, 1024, 1 }                                                                                                                                                                                                                                                                       [8/411]
>>> [INFO] CL_GPU 10 times ./kernel.cl.mat_mult_naive starting ...
0        0.578418
>>> [INFO] skip first 1 time(s)
1        0.574513
2        0.624397
3        0.575214
4        0.578982
5        0.575077
6        0.608436
7        0.575884
8        0.624576
9        0.594593
10       0.576112
gflops: 2147483648.000000 
ave_duration: 0.590778
>>> [INFO] CL_GPU 1024x1024x1024 0.590778 s 3.635007 GFLOPS

>>> [TEST] correct rate: 1.0000
>>> [TEST] ~ Bingo ~ matrix a == matrix b

int

The benchmark result is not stable: naive matrix multiplication costs 0.58~0.77 seconds, corresponding gflops between 2.9~3.5 GFLOPS.

$ ./matrixMultiplication 1024 1024 1024 ./kernelmat_mult_naive 5 1024 1024 1                                                                                                                                                                                                [10/70]
============== INIT ==============
>>> [INFO] ELEM_TYPE_STR: int, sizeof(ELEM_TYPE): 4
>>> [INFO] CL_ELEM_TYPE_STR: int, sizeof(CL_ELEM_TYPE): 4
>>> [INFO] len_a: 1048576, len_b: 1048576, len_c: 1048576
>>> [INFO] data_size_a: 4194304, data_size_b: 4194304, data_size_c: 4194304

============== CPU RESULT ==============
>>> [INFO] 5 times CPU starting...
0        49.456371
>>> [INFO] skip first 1 time(s)
1        49.445682
2        51.025633
3        49.424161
4        50.706564
5        50.874110
>>> [INFO] CPU 1024x1024x1024 50.295230 s 0.042698 GFLOPS

============== GPU RESULT ==============
>>> [INFO] Device name: Mali-T86x MP4 r2p0 0x0860
>>> [INFO] program_file: ./kernel.cl, kernel_func: mat_mult_naive
>>> [INFO] global_work_size[3]: { 1024, 1024, 1 }
>>> [INFO] CL_GPU 5 times ./kernel.cl.mat_mult_naive starting ...
0        0.666985
>>> [INFO] skip first 1 time(s)
1        0.628585
2        0.577195
3        0.577648
4        0.576896
5        0.575481
gflops: 2147483648.000000 
ave_duration: 0.587161
>>> [INFO] CL_GPU 1024x1024x1024 0.587161 s 3.657402 GFLOPS

>>> [TEST] correct rate: 1.0000

half

Stable. Cost 0.550 seconds and corresponding gflops: 4.21~4.24 GFLOPS.

$ ./matrixMultiplication 1024 1024 1024 ./kernel.cl mat_mult_naive 10 1024 1024 1
============== INIT ==============
>>> [INFO] ELEM_TYPE_STR: __fp16, sizeof(ELEM_TYPE): 2
>>> [INFO] CL_ELEM_TYPE_STR: half, sizeof(CL_ELEM_TYPE): 2
>>> [INFO] len_a: 1048576, len_b: 1048576, len_c: 1048576
>>> [INFO] data_size_a: 2097152, data_size_b: 2097152, data_size_c: 2097152

============== CPU RESULT ==============
>>> [INFO] 10 times CPU starting...
0        22.410335
>>> [INFO] skip first 1 time(s)
1        22.575216
2        22.340271
3        22.540361
4        22.486432
5        22.455014
6        22.280294
7        22.382801
8        22.492505
9        22.183580
10       22.428460
>>> [INFO] CPU 1024x1024x1024 22.416493 s 0.095799 GFLOPS

============== GPU RESULT ==============
>>> [INFO] Device name: Mali-T86x MP4 r2p0 0x0860
>>> [INFO] program_file: ./kernel.cl, kernel_func: mat_mult_naive
>>> [INFO] global_work_size[3]: { 1024, 1024, 1 }
>>> [INFO] CL_GPU 10 times ./kernel.cl.mat_mult_naive starting ...
0        0.511446
>>> [INFO] skip first 1 time(s)
1        0.506768
2        0.504923
3        0.505440
4        0.505335
5        0.505887
6        0.504761
7        0.505133
8        0.505742
9        0.506590
10       0.504916
gflops: 2147483648.000000 
ave_duration: 0.505549
>>> [INFO] CL_GPU 1024x1024x1024 0.505549 s 4.247821 GFLOPS

>>> [TEST] correct rate: 1.0000
>>> [TEST] ~ Bingo ~ matrix a == matrix b

double

============== INIT =============
>>> [INFO] ELEM_TYPE_STR: double, sizeof(ELEM_TYPE): 8
>>> [INFO] CL_ELEM_TYPE_STR: double, sizeof(CL_ELEM_TYPE): 8
>>> [INFO] len_a: 1048576, len_b: 1048576, len_c: 1048576
>>> [INFO] data_size_a: 8388608, data_size_b: 8388608, data_size_c: 8388608

============== CPU RESULT ==============
>>> [INFO] 10 times CPU starting...
0        62.132808
>>> [INFO] skip first 1 time(s)
1        62.124912
2        62.114249
3        62.107852
4        62.124306
5        62.125422
6        62.110446
7        62.129060
8        62.106102
9        63.081955
10       65.068961
>>> [INFO] CPU 1024x1024x1024 62.509327 s 0.034355 GFLOPS

============== GPU RESULT ==============
>>> [INFO] Device name: Mali-T86x MP4 r2p0 0x0860
>>> [INFO] program_file: ./kernel.cl, kernel_func: mat_mult_naive
>>> [INFO] global_work_size[3]: { 1024, 1024, 1 }
>>> [INFO] CL_GPU 10 times ./kernel.cl.mat_mult_naive starting ...
0        0.758725
>>> [INFO] skip first 1 time(s)
1        0.759127
2        0.752441
3        0.737415
4        0.738173
5        0.736296
6        0.790745
7        0.796274
8        0.735750
9        0.741224
10       0.754608
gflops: 2147483648.000000 
ave_duration: 0.754205
>>> [INFO] CL_GPU 1024x1024x1024 0.754205 s 2.847346 GFLOPS

>>> [TEST] correct rate: 1.0000
>>> [TEST] ~ Bingo ~ matrix a == matrix b
ysh329 commented 6 years ago

float2

$ ./matrixMultiplication 1024 1024 1024 ./kernel.cl mat_mult_vec2 0 20 $[1024] $[1024] 1                                                                                                                                                                                   [21/825]
============== INIT ==============
>>> [INFO] ELEM_TYPE_STR: float, sizeof(ELEM_TYPE): 4
>>> [INFO] CL_ELEM_TYPE_STR: float2, sizeof(CL_ELEM_TYPE): 8
>>> [WARN] ELEM_TYPE(float) size differs from CL_ELEM_TYPE(float2)
>>> [INFO] len_a: 1048576, len_b: 1048576, len_c: 1048576
>>> [INFO] data_size_a: 4194304, data_size_b: 4194304, data_size_c: 4194304

============== CPU RESULT ==============
>>> [INFO] 0 times CPU starting...
0        32.072621
>>> [INFO] skip first 1 time(s)
>>> [INFO] CPU 1024x1024x1024 nan s nan GFLOPS

============== GPU RESULT ==============
>>> [INFO] Device name: Mali-T86x MP4 r2p0 0x0860
>>> [INFO] program_file: ./kernel.cl, kernel_func: mat_mult_vec2
>>> [INFO] global_work_size[3]: { 1024, 1024, 1 }
>>> [INFO] CL_GPU 20 times ./kernel.cl.mat_mult_vec2 starting ...
0        0.745514
>>> [INFO] skip first 1 time(s)
1        0.664003
2        0.682694
3        0.705091
4        0.664341
5        0.665731
6        0.665375
7        0.663946
8        0.665404
9        0.704201
10       0.678438
11       0.699856
12       0.740078
13       0.664164
14       0.681739
15       0.662163
16       0.664076
17       0.664025
18       0.669523
19       0.690585
20       0.665842
>>> [INFO] CL_GPU 1024x1024x1024 0.678064 s 3.167082 GFLOPS

>>> [TEST] correct rate: 1.0000
>>> [TEST] ~ Bingo ~ matrix a == matrix b

int2

$ ./matrixMultiplication 1024 1024 1024 ./kernel.cl mat_mult_vec2 0 20 $[1024] $[1024] 1                                                                                                                                                                                   [21/929]
============== INIT ==============
>>> [INFO] ELEM_TYPE_STR: int, sizeof(ELEM_TYPE): 4
>>> [INFO] CL_ELEM_TYPE_STR: int2, sizeof(CL_ELEM_TYPE): 8
>>> [WARN] ELEM_TYPE(int) size differs from CL_ELEM_TYPE(int2)
>>> [INFO] len_a: 1048576, len_b: 1048576, len_c: 1048576
>>> [INFO] data_size_a: 4194304, data_size_b: 4194304, data_size_c: 4194304

============== CPU RESULT ==============
>>> [INFO] 0 times CPU starting...
0        42.547083
>>> [INFO] skip first 1 time(s)
>>> [INFO] CPU 1024x1024x1024 nan s nan GFLOPS

============== GPU RESULT ==============
>>> [INFO] Device name: Mali-T86x MP4 r2p0 0x0860
>>> [INFO] program_file: ./kernel.cl, kernel_func: mat_mult_vec2
>>> [INFO] global_work_size[3]: { 1024, 1024, 1 }
>>> [INFO] CL_GPU 20 times ./kernel.cl.mat_mult_vec2 starting ...
0        0.729240
>>> [INFO] skip first 1 time(s)
1        0.663346
2        0.663381
3        0.665677
4        0.688709
5        0.667141
6        0.675088
7        0.668971
8        0.677944
9        0.677716
10       0.662854
11       0.683053
12       0.663949
13       0.707164
14       0.664728
15       0.667829
16       0.669991
17       0.661924
18       0.691719
19       0.660779
20       0.726561
>>> [INFO] CL_GPU 1024x1024x1024 0.675426 s 3.179450 GFLOPS

>>> [TEST] correct rate: 1.0000
>>> [TEST] ~ Bingo ~ matrix a == matrix b

half2

$ ./matrixMultiplication 1024 1024 1024 ./kernel.cl mat_mult_vec2 0 20 $[1024] $[1024] 1                                                                                                                                                                                  [20/1029]
============== INIT ==============
>>> [INFO] ELEM_TYPE_STR: __fp16, sizeof(ELEM_TYPE): 2
>>> [INFO] CL_ELEM_TYPE_STR: half2, sizeof(CL_ELEM_TYPE): 2
>>> [INFO] len_a: 1048576, len_b: 1048576, len_c: 1048576
>>> [INFO] data_size_a: 2097152, data_size_b: 2097152, data_size_c: 2097152

============== CPU RESULT ==============
>>> [INFO] 0 times CPU starting...
0        21.096589
>>> [INFO] skip first 1 time(s)
>>> [INFO] CPU 1024x1024x1024 nan s nan GFLOPS

============== GPU RESULT ==============
>>> [INFO] Device name: Mali-T86x MP4 r2p0 0x0860
>>> [INFO] program_file: ./kernel.cl, kernel_func: mat_mult_vec2
>>> [INFO] global_work_size[3]: { 1024, 1024, 1 }
>>> [INFO] CL_GPU 20 times ./kernel.cl.mat_mult_vec2 starting ...
0        0.697708
>>> [INFO] skip first 1 time(s)
1        0.598590
2        0.598986
3        0.599458
4        0.598561
5        0.598886
6        0.600000
7        0.600150
8        0.603970
9        0.600956
10       0.601270
11       0.597891
12       0.598698
13       0.598759
14       0.599476
15       0.598256
16       0.599068
17       0.598711
18       0.599195
19       0.599339
20       0.599145
>>> [INFO] CL_GPU 1024x1024x1024 0.599468 s 3.582314 GFLOPS

>>> [TEST] correct rate: 1.0000
>>> [TEST] ~ Bingo ~ matrix a == matrix b
ysh329 commented 6 years ago

half4

$ ./matrixMultiplication 1024 1024 1024 ./vec4.cl mat_mult_vec4 0 20 $[1024] $[1024] 1
============== INIT ==============
>>> [INFO] ELEM_TYPE_STR: __fp16, sizeof(ELEM_TYPE): 2
>>> [INFO] CL_ELEM_TYPE_STR: half4, sizeof(CL_ELEM_TYPE): 2
>>> [INFO] len_a: 1048576, len_b: 1048576, len_c: 1048576
>>> [INFO] data_size_a: 2097152, data_size_b: 2097152, data_size_c: 2097152

============== CPU RESULT ==============
>>> [INFO] 0 times CPU starting...
0        24.388975
>>> [INFO] skip first 1 time(s)
>>> [INFO] CPU 1024x1024x1024 nan s nan GFLOPS

============== GPU RESULT ==============
>>> [INFO] Device name: Mali-T86x MP4 r2p0 0x0860
>>> [INFO] program_file: ./vec4.cl, kernel_func: mat_mult_vec4
>>> [INFO] global_work_size[3]: { 1024, 1024, 1 }
>>> [INFO] CL_GPU 20 times ./vec4.cl.mat_mult_vec4 starting ...
0        0.785168
>>> [INFO] skip first 1 time(s)
1        0.785488
2        0.785155
3        0.784971
4        0.785315
5        0.784797
6        0.784791
7        0.785390
8        0.785067
9        0.784999
10       0.784745
11       0.786073
12       0.784571
13       0.786385
14       0.784829
15       0.785454
16       0.784674
17       0.784963
18       0.786262
19       0.784866
20       0.785740
>>> [INFO] CL_GPU 1024x1024x1024 0.785227 s 2.734858 GFLOPS

>>> [TEST] correct rate: 1.0000
>>> [TEST] ~ Bingo ~ matrix a == matrix b

int4

$ ./matrixMultiplication 1024 1024 1024 ./vec4.cl mat_mult_vec4 0 20 $[1024] $[1024] 1                                                                                                                                                                                    [22/1482]
============== INIT ==============
>>> [INFO] ELEM_TYPE_STR: int, sizeof(ELEM_TYPE): 4
>>> [INFO] CL_ELEM_TYPE_STR: int4, sizeof(CL_ELEM_TYPE): 4
>>> [INFO] len_a: 1048576, len_b: 1048576, len_c: 1048576
>>> [INFO] data_size_a: 4194304, data_size_b: 4194304, data_size_c: 4194304

============== CPU RESULT ==============
>>> [INFO] 0 times CPU starting...
0        50.925559
>>> [INFO] skip first 1 time(s)
>>> [INFO] CPU 1024x1024x1024 nan s nan GFLOPS

============== GPU RESULT ==============
>>> [INFO] Device name: Mali-T86x MP4 r2p0 0x0860
>>> [INFO] program_file: ./vec4.cl, kernel_func: mat_mult_vec4
>>> [INFO] global_work_size[3]: { 1024, 1024, 1 }
>>> [INFO] CL_GPU 20 times ./vec4.cl.mat_mult_vec4 starting ...
0        1.227366
>>> [INFO] skip first 1 time(s)
1        1.232017
2        1.230932
3        1.227194
4        1.231709
5        1.232883
6        1.229543
7        1.228786
8        1.229696
9        1.227157
10       1.227646
11       1.227292
12       1.227880
13       1.228065
14       1.231296
15       1.226779
16       1.227354
17       1.227589
18       1.227081
19       1.229803
20       1.231007
>>> [INFO] CL_GPU 1024x1024x1024 1.229085 s 1.747221 GFLOPS

>>> [TEST] correct rate: 1.0000
>>> [TEST] ~ Bingo ~ matrix a == matrix b

float4

$ ./matrixMultiplication 1024 1024 1024 ./vec4.cl mat_mult_vec4 0 20 $[1024] $[1024] 1                                                                                                                                                                                    [22/1383]
============== INIT ==============
>>> [INFO] ELEM_TYPE_STR: float, sizeof(ELEM_TYPE): 4
>>> [INFO] CL_ELEM_TYPE_STR: float4, sizeof(CL_ELEM_TYPE): 4
>>> [INFO] len_a: 1048576, len_b: 1048576, len_c: 1048576
>>> [INFO] data_size_a: 4194304, data_size_b: 4194304, data_size_c: 4194304

============== CPU RESULT ==============
>>> [INFO] 0 times CPU starting...
0        33.002391
>>> [INFO] skip first 1 time(s)
>>> [INFO] CPU 1024x1024x1024 nan s nan GFLOPS

============== GPU RESULT ==============
>>> [INFO] Device name: Mali-T86x MP4 r2p0 0x0860
>>> [INFO] program_file: ./vec4.cl, kernel_func: mat_mult_vec4
>>> [INFO] global_work_size[3]: { 1024, 1024, 1 }
>>> [INFO] CL_GPU 20 times ./vec4.cl.mat_mult_vec4 starting ...
0        1.252484
>>> [INFO] skip first 1 time(s)
1        1.250876
2        1.249836
3        1.249639
4        1.224961
5        1.238079
6        1.249810
7        1.250827
8        1.253317
9        1.250303
10       1.248669
11       1.250466
12       1.249963
13       1.249999
14       1.249263
15       1.248959
16       1.250244
17       1.249647
18       1.249784
19       1.249077
20       1.248922
>>> [INFO] CL_GPU 1024x1024x1024 1.248132 s 1.720558 GFLOPS

>>> [TEST] correct rate: 1.0000
>>> [TEST] ~ Bingo ~ matrix a == matrix b
ysh329 commented 6 years ago

int8

$ ./matrixMultiplication 1024 1024 1024 ./vec8.cl mat_mult_vec8 0 20 $[1024] $[1024] 1                                                                                                                                                                                    [22/1585]
============== INIT ==============
>>> [INFO] ELEM_TYPE_STR: int, sizeof(ELEM_TYPE): 4
>>> [INFO] CL_ELEM_TYPE_STR: int8, sizeof(CL_ELEM_TYPE): 4
>>> [INFO] len_a: 1048576, len_b: 1048576, len_c: 1048576
>>> [INFO] data_size_a: 4194304, data_size_b: 4194304, data_size_c: 4194304

============== CPU RESULT ==============
>>> [INFO] 0 times CPU starting...
0        48.183118
>>> [INFO] skip first 1 time(s)
>>> [INFO] CPU 1024x1024x1024 nan s nan GFLOPS

============== GPU RESULT ==============
>>> [INFO] Device name: Mali-T86x MP4 r2p0 0x0860
>>> [INFO] program_file: ./vec8.cl, kernel_func: mat_mult_vec8
>>> [INFO] global_work_size[3]: { 1024, 1024, 1 }
>>> [INFO] CL_GPU 20 times ./vec8.cl.mat_mult_vec8 starting ...
0        1.245643
>>> [INFO] skip first 1 time(s)
1        1.247249
2        1.246787
3        1.247569
4        1.247150
5        1.246996
6        1.248807
7        1.246896
8        1.247675
9        1.248380
10       1.245348
11       1.249068
12       1.248908
13       1.248970
14       1.247565
15       1.251570
16       1.248810
17       1.250007
18       1.247615
19       1.248424
20       1.248822
>>> [INFO] CL_GPU 1024x1024x1024 1.248131 s 1.720560 GFLOPS

>>> [TEST] correct rate: 1.0000
>>> [TEST] ~ Bingo ~ matrix a == matrix b

half8

$ ./matrixMultiplication 1024 1024 1024 ./vec8.cl mat_mult_vec8 0 20 $[1024] $[1024] 1                                                                                                                                                                                    [22/1684]
============== INIT ==============
>>> [INFO] ELEM_TYPE_STR: __fp16, sizeof(ELEM_TYPE): 2
>>> [INFO] CL_ELEM_TYPE_STR: half8, sizeof(CL_ELEM_TYPE): 2
>>> [INFO] len_a: 1048576, len_b: 1048576, len_c: 1048576
>>> [INFO] data_size_a: 2097152, data_size_b: 2097152, data_size_c: 2097152

============== CPU RESULT ==============
>>> [INFO] 0 times CPU starting...
0        21.785163
>>> [INFO] skip first 1 time(s)
>>> [INFO] CPU 1024x1024x1024 nan s nan GFLOPS

============== GPU RESULT ==============
>>> [INFO] Device name: Mali-T86x MP4 r2p0 0x0860
>>> [INFO] program_file: ./vec8.cl, kernel_func: mat_mult_vec8
>>> [INFO] global_work_size[3]: { 1024, 1024, 1 }
>>> [INFO] CL_GPU 20 times ./vec8.cl.mat_mult_vec8 starting ...
0        1.064921
>>> [INFO] skip first 1 time(s)
1        1.060214
2        1.063404
3        1.066602
4        1.067873
5        1.063292
6        1.058918
7        1.064117
8        1.061240
9        1.066587
10       1.062076
11       1.065420
12       1.063350
13       1.059808
14       1.066744
15       1.065193
16       1.062735
17       1.067862
18       1.066636
19       1.063445
20       1.067638
>>> [INFO] CL_GPU 1024x1024x1024 1.064158 s 2.018013 GFLOPS

>>> [TEST] correct rate: 1.0000
>>> [TEST] ~ Bingo ~ matrix a == matrix b

float8

$ ./matrixMultiplication 1024 1024 1024 ./vec8.cl mat_mult_vec8 0 20 $[1024] $[1024] 1                                                                                                                                                                                    [22/1784]
============== INIT ==============
>>> [INFO] ELEM_TYPE_STR: float, sizeof(ELEM_TYPE): 4
>>> [INFO] CL_ELEM_TYPE_STR: float8, sizeof(CL_ELEM_TYPE): 4
>>> [INFO] len_a: 1048576, len_b: 1048576, len_c: 1048576
>>> [INFO] data_size_a: 4194304, data_size_b: 4194304, data_size_c: 4194304

============== CPU RESULT ==============
>>> [INFO] 0 times CPU starting...
0        32.317702
>>> [INFO] skip first 1 time(s)
>>> [INFO] CPU 1024x1024x1024 nan s nan GFLOPS

============== GPU RESULT ==============
>>> [INFO] Device name: Mali-T86x MP4 r2p0 0x0860
>>> [INFO] program_file: ./vec8.cl, kernel_func: mat_mult_vec8
>>> [INFO] global_work_size[3]: { 1024, 1024, 1 }
>>> [INFO] CL_GPU 20 times ./vec8.cl.mat_mult_vec8 starting ...
0        1.227704
>>> [INFO] skip first 1 time(s)
1        1.225342
2        1.226772
3        1.223779
4        1.225490
5        1.226510
6        1.226443
7        1.227413
8        1.225230
9        1.222004
10       1.226666
11       1.225248
12       1.224709
13       1.223577
14       1.224604
15       1.225972
16       1.225790
17       1.225214
18       1.224669
19       1.225131
20       1.225476
>>> [INFO] CL_GPU 1024x1024x1024 1.225302 s 1.752616 GFLOPS

>>> [TEST] correct rate: 1.0000
>>> [TEST] ~ Bingo ~ matrix a == matrix b
ysh329 commented 6 years ago

float16

$ ./matrixMultiplication 1024 1024 1024 ./vec16.cl mat_mult_vec16 0 20 $[1024] $[1024] 1                                                                                                                                                                                   [22/243]
============== INIT ==============
>>> [INFO] ELEM_TYPE_STR: float, sizeof(ELEM_TYPE): 4
>>> [INFO] CL_ELEM_TYPE_STR: float16, sizeof(CL_ELEM_TYPE): 4
>>> [INFO] len_a: 1048576, len_b: 1048576, len_c: 1048576
>>> [INFO] data_size_a: 4194304, data_size_b: 4194304, data_size_c: 4194304

============== CPU RESULT ==============
>>> [INFO] 0 times CPU starting...
0        32.313083
>>> [INFO] skip first 1 time(s)
>>> [INFO] CPU 1024x1024x1024 nan s nan GFLOPS

============== GPU RESULT ==============
>>> [INFO] Device name: Mali-T86x MP4 r2p0 0x0860
>>> [INFO] program_file: ./vec16.cl, kernel_func: mat_mult_vec16
>>> [INFO] global_work_size[3]: { 1024, 1024, 1 }
>>> [INFO] CL_GPU 20 times ./vec16.cl.mat_mult_vec16 starting ...
0        1.584943
>>> [INFO] skip first 1 time(s)
1        1.570398
2        1.582975
3        1.572269
4        1.585434
5        1.571658
6        1.585436
7        1.572811
8        1.583057
9        1.572311
10       1.585094
11       1.573029
12       1.588597
13       1.572048
14       1.585594
15       1.571509
16       1.588572
17       1.571195
18       1.588734
19       1.578679
20       1.586458
>>> [INFO] CL_GPU 1024x1024x1024 1.579293 s 1.359775 GFLOPS

>>> [TEST] correct rate: 1.0000
>>> [TEST] ~ Bingo ~ matrix a == matrix b

half16

$ ./matrixMultiplication 1024 1024 1024 ./vec16.cl mat_mult_vec16 0 20 $[1024] $[1024] 1                                                                                                                                                                                   [22/297]
============== INIT ==============
>>> [INFO] ELEM_TYPE_STR: __fp16, sizeof(ELEM_TYPE): 2
>>> [INFO] CL_ELEM_TYPE_STR: half16, sizeof(CL_ELEM_TYPE): 2
>>> [INFO] len_a: 1048576, len_b: 1048576, len_c: 1048576
>>> [INFO] data_size_a: 2097152, data_size_b: 2097152, data_size_c: 2097152

============== CPU RESULT ==============
>>> [INFO] 0 times CPU starting...
0        21.933312
>>> [INFO] skip first 1 time(s)
>>> [INFO] CPU 1024x1024x1024 nan s nan GFLOPS

============== GPU RESULT ==============
>>> [INFO] Device name: Mali-T86x MP4 r2p0 0x0860
>>> [INFO] program_file: ./vec16.cl, kernel_func: mat_mult_vec16
>>> [INFO] global_work_size[3]: { 1024, 1024, 1 }
>>> [INFO] CL_GPU 20 times ./vec16.cl.mat_mult_vec16 starting ...
0        1.113329
>>> [INFO] skip first 1 time(s)
1        1.109171
2        1.112040
3        1.118125
4        1.108358
5        1.112967
6        1.115943
7        1.109845
8        1.109008
9        1.110358
10       1.110518
11       1.105643
12       1.113373
13       1.110546
14       1.109875
15       1.106689
16       1.116894
17       1.114391
18       1.115844
19       1.108951
20       1.108160
>>> [INFO] CL_GPU 1024x1024x1024 1.111335 s 1.932346 GFLOPS

int16

$ ./matrixMultiplication 1024 1024 1024 ./vec16.cl mat_mult_vec16 0 20 $[1024] $[1024] 1                                                                                                                                                                                   [22/351]
============== INIT ==============
>>> [INFO] ELEM_TYPE_STR: int, sizeof(ELEM_TYPE): 4
>>> [INFO] CL_ELEM_TYPE_STR: int16, sizeof(CL_ELEM_TYPE): 4
>>> [INFO] len_a: 1048576, len_b: 1048576, len_c: 1048576
>>> [INFO] data_size_a: 4194304, data_size_b: 4194304, data_size_c: 4194304

============== CPU RESULT ==============
>>> [INFO] 0 times CPU starting...
0        44.288010
>>> [INFO] skip first 1 time(s)
>>> [INFO] CPU 1024x1024x1024 nan s nan GFLOPS

============== GPU RESULT ==============
>>> [INFO] Device name: Mali-T86x MP4 r2p0 0x0860
>>> [INFO] program_file: ./vec16.cl, kernel_func: mat_mult_vec16
>>> [INFO] global_work_size[3]: { 1024, 1024, 1 }
>>> [INFO] CL_GPU 20 times ./vec16.cl.mat_mult_vec16 starting ...
0         1.560015
>>> [INFO] skip first 1 time(s)
1        1.549140
2        1.556859
3        1.546036
4        1.557132
5        1.550042
6        1.560190
7        1.551438
8        1.563459
9        1.548408
10       1.557484
11       1.551499
12       1.558829
13       1.551887
14       1.558873
15       1.549608
16       1.557619
17       1.553162
18       1.558806
19       1.553744
20       1.558716
>>> [INFO] CL_GPU 1024x1024x1024 1.554647 s 1.381332 GFLOPS

>>> [TEST] correct rate: 1.0000
>>> [TEST] ~ Bingo ~ matrix a == matrix b
ysh329 commented 6 years ago

The scale of matrix multiplication is 1024 x 1024 x 1024, and task size equals to global work size situation (unit is seconds):

int 0.587161 int2 0.675426 int4 1.229085 int8 1.248131 int16 1.554647

float 0.590778 float2 0.678064 float4 1.248132 float8 1.225302 float16 1.579293

half 0.505549 half2 0.599468 half4 0.785227 half8 1.064158 half16 1.111335