ysh329 commented 6 years ago

NV Card

Temporarily benchmark on NV Card

Machine: linux-W580-G20, x86_64-linux-gnu
Device: TITAN X (Pascal), 12189MiB
M: 1024; N: 1024; K: 1024

Naive Kernel
float; 1024 1024 1024; 10; 353.967207 GFLOPS
int; 1024 1024 1024; 10; 356.647841 GFLOPS

__kernel void mat_mult_naive(const int M, const int N, const int K, __global const CL_INPUT_TYPE *a, __global const CL_INPUT_TYPE *b, __global CL_INPUT_TYPE *c) {
    const int col = get_global_id(0);
    const int row = get_global_id(1);
    CL_ELEM_TYPE res = 0;

    for (int p = 0; p < K; p++) {
        res += a[row * M + p] * b[p * N + col];
    }   
    c[row * N + col] = res;
}

ysh329 commented 6 years ago

Mali-T860

Before benmark, set performace mode first:

$ sudo ../tools/ck-print-gpu-freq                                                                                    
*** Current GPU frequency:
800000000
*** Min frequency:
200000000
*** Max frequency:
800000000
*** Available GPU frequencies:
200000000 297000000 400000000 500000000 594000000 800000000
*** Current GPU governor:
performance
*** Available GPU governor:
userspace powersave performance simple_ondemand

Naive kernel

1024 x 1024 x 1024

type	execution time (second)	GFLOPS
int32	0.5871	3.657402
float32	0.5907	3.635007
double64	0.7542	2.847346
half16	0.5055	4.247821

float

Stable. Cost 0.590778 seconds and corresponding gflops: 3.635007 GFLOPS.

============== INIT ==============
>>> [INFO] ELEM_TYPE_STR: float, sizeof(ELEM_TYPE): 4
>>> [INFO] CL_ELEM_TYPE_STR: float, sizeof(CL_ELEM_TYPE): 4
>>> [INFO] len_a: 1048576, len_b: 1048576, len_c: 1048576
>>> [INFO] data_size_a: 4194304, data_size_b: 4194304, data_size_c: 4194304

============== CPU RESULT ==============
>>> [INFO] 10 times CPU starting...
0        32.135118
>>> [INFO] skip first 1 time(s)
1        31.683712
2        31.796608
3        32.048721
4        31.663059
5        32.179186
6        31.730488
7        32.137520
8        31.663593
9        31.636981
10       31.600799
>>> [INFO] CPU 1024x1024x1024 31.814067 s 0.067501 GFLOPS

============== GPU RESULT ==============
>>> [INFO] Device name: Mali-T86x MP4 r2p0 0x0860
>>> [INFO] program_file: ./kernel.cl, kernel_func: mat_mult_naive
>>> [INFO] global_work_size[3]: { 1024, 1024, 1 }                                                                                                                                                                                                                                                                       [8/411]
>>> [INFO] CL_GPU 10 times ./kernel.cl.mat_mult_naive starting ...
0        0.578418
>>> [INFO] skip first 1 time(s)
1        0.574513
2        0.624397
3        0.575214
4        0.578982
5        0.575077
6        0.608436
7        0.575884
8        0.624576
9        0.594593
10       0.576112
gflops: 2147483648.000000 
ave_duration: 0.590778
>>> [INFO] CL_GPU 1024x1024x1024 0.590778 s 3.635007 GFLOPS

>>> [TEST] correct rate: 1.0000
>>> [TEST] ~ Bingo ~ matrix a == matrix b

int

The benchmark result is not stable: naive matrix multiplication costs 0.58~0.77 seconds, corresponding gflops between 2.9~3.5 GFLOPS.

$ ./matrixMultiplication 1024 1024 1024 ./kernelmat_mult_naive 5 1024 1024 1                                                                                                                                                                                                [10/70]
============== INIT ==============
>>> [INFO] ELEM_TYPE_STR: int, sizeof(ELEM_TYPE): 4
>>> [INFO] CL_ELEM_TYPE_STR: int, sizeof(CL_ELEM_TYPE): 4
>>> [INFO] len_a: 1048576, len_b: 1048576, len_c: 1048576
>>> [INFO] data_size_a: 4194304, data_size_b: 4194304, data_size_c: 4194304

============== CPU RESULT ==============
>>> [INFO] 5 times CPU starting...
0        49.456371
>>> [INFO] skip first 1 time(s)
1        49.445682
2        51.025633
3        49.424161
4        50.706564
5        50.874110
>>> [INFO] CPU 1024x1024x1024 50.295230 s 0.042698 GFLOPS

============== GPU RESULT ==============
>>> [INFO] Device name: Mali-T86x MP4 r2p0 0x0860
>>> [INFO] program_file: ./kernel.cl, kernel_func: mat_mult_naive
>>> [INFO] global_work_size[3]: { 1024, 1024, 1 }
>>> [INFO] CL_GPU 5 times ./kernel.cl.mat_mult_naive starting ...
0        0.666985
>>> [INFO] skip first 1 time(s)
1        0.628585
2        0.577195
3        0.577648
4        0.576896
5        0.575481
gflops: 2147483648.000000 
ave_duration: 0.587161
>>> [INFO] CL_GPU 1024x1024x1024 0.587161 s 3.657402 GFLOPS

>>> [TEST] correct rate: 1.0000

half

Stable. Cost 0.550 seconds and corresponding gflops: 4.21~4.24 GFLOPS.

$ ./matrixMultiplication 1024 1024 1024 ./kernel.cl mat_mult_naive 10 1024 1024 1
============== INIT ==============
>>> [INFO] ELEM_TYPE_STR: __fp16, sizeof(ELEM_TYPE): 2
>>> [INFO] CL_ELEM_TYPE_STR: half, sizeof(CL_ELEM_TYPE): 2
>>> [INFO] len_a: 1048576, len_b: 1048576, len_c: 1048576
>>> [INFO] data_size_a: 2097152, data_size_b: 2097152, data_size_c: 2097152

============== CPU RESULT ==============
>>> [INFO] 10 times CPU starting...
0        22.410335
>>> [INFO] skip first 1 time(s)
1        22.575216
2        22.340271
3        22.540361
4        22.486432
5        22.455014
6        22.280294
7        22.382801
8        22.492505
9        22.183580
10       22.428460
>>> [INFO] CPU 1024x1024x1024 22.416493 s 0.095799 GFLOPS

============== GPU RESULT ==============
>>> [INFO] Device name: Mali-T86x MP4 r2p0 0x0860
>>> [INFO] program_file: ./kernel.cl, kernel_func: mat_mult_naive
>>> [INFO] global_work_size[3]: { 1024, 1024, 1 }
>>> [INFO] CL_GPU 10 times ./kernel.cl.mat_mult_naive starting ...
0        0.511446
>>> [INFO] skip first 1 time(s)
1        0.506768
2        0.504923
3        0.505440
4        0.505335
5        0.505887
6        0.504761
7        0.505133
8        0.505742
9        0.506590
10       0.504916
gflops: 2147483648.000000 
ave_duration: 0.505549
>>> [INFO] CL_GPU 1024x1024x1024 0.505549 s 4.247821 GFLOPS

>>> [TEST] correct rate: 1.0000
>>> [TEST] ~ Bingo ~ matrix a == matrix b

double

============== INIT =============
>>> [INFO] ELEM_TYPE_STR: double, sizeof(ELEM_TYPE): 8
>>> [INFO] CL_ELEM_TYPE_STR: double, sizeof(CL_ELEM_TYPE): 8
>>> [INFO] len_a: 1048576, len_b: 1048576, len_c: 1048576
>>> [INFO] data_size_a: 8388608, data_size_b: 8388608, data_size_c: 8388608

============== CPU RESULT ==============
>>> [INFO] 10 times CPU starting...
0        62.132808
>>> [INFO] skip first 1 time(s)
1        62.124912
2        62.114249
3        62.107852
4        62.124306
5        62.125422
6        62.110446
7        62.129060
8        62.106102
9        63.081955
10       65.068961
>>> [INFO] CPU 1024x1024x1024 62.509327 s 0.034355 GFLOPS

============== GPU RESULT ==============
>>> [INFO] Device name: Mali-T86x MP4 r2p0 0x0860
>>> [INFO] program_file: ./kernel.cl, kernel_func: mat_mult_naive
>>> [INFO] global_work_size[3]: { 1024, 1024, 1 }
>>> [INFO] CL_GPU 10 times ./kernel.cl.mat_mult_naive starting ...
0        0.758725
>>> [INFO] skip first 1 time(s)
1        0.759127
2        0.752441
3        0.737415
4        0.738173
5        0.736296
6        0.790745
7        0.796274
8        0.735750
9        0.741224
10       0.754608
gflops: 2147483648.000000 
ave_duration: 0.754205
>>> [INFO] CL_GPU 1024x1024x1024 0.754205 s 2.847346 GFLOPS

>>> [TEST] correct rate: 1.0000
>>> [TEST] ~ Bingo ~ matrix a == matrix b

ysh329 commented 6 years ago

float2

$ ./matrixMultiplication 1024 1024 1024 ./kernel.cl mat_mult_vec2 0 20 $[1024] $[1024] 1                                                                                                                                                                                   [21/825]
============== INIT ==============
>>> [INFO] ELEM_TYPE_STR: float, sizeof(ELEM_TYPE): 4
>>> [INFO] CL_ELEM_TYPE_STR: float2, sizeof(CL_ELEM_TYPE): 8
>>> [WARN] ELEM_TYPE(float) size differs from CL_ELEM_TYPE(float2)
>>> [INFO] len_a: 1048576, len_b: 1048576, len_c: 1048576
>>> [INFO] data_size_a: 4194304, data_size_b: 4194304, data_size_c: 4194304

============== CPU RESULT ==============
>>> [INFO] 0 times CPU starting...
0        32.072621
>>> [INFO] skip first 1 time(s)
>>> [INFO] CPU 1024x1024x1024 nan s nan GFLOPS

============== GPU RESULT ==============
>>> [INFO] Device name: Mali-T86x MP4 r2p0 0x0860
>>> [INFO] program_file: ./kernel.cl, kernel_func: mat_mult_vec2
>>> [INFO] global_work_size[3]: { 1024, 1024, 1 }
>>> [INFO] CL_GPU 20 times ./kernel.cl.mat_mult_vec2 starting ...
0        0.745514
>>> [INFO] skip first 1 time(s)
1        0.664003
2        0.682694
3        0.705091
4        0.664341
5        0.665731
6        0.665375
7        0.663946
8        0.665404
9        0.704201
10       0.678438
11       0.699856
12       0.740078
13       0.664164
14       0.681739
15       0.662163
16       0.664076
17       0.664025
18       0.669523
19       0.690585
20       0.665842
>>> [INFO] CL_GPU 1024x1024x1024 0.678064 s 3.167082 GFLOPS

>>> [TEST] correct rate: 1.0000
>>> [TEST] ~ Bingo ~ matrix a == matrix b

int2

$ ./matrixMultiplication 1024 1024 1024 ./kernel.cl mat_mult_vec2 0 20 $[1024] $[1024] 1                                                                                                                                                                                   [21/929]
============== INIT ==============
>>> [INFO] ELEM_TYPE_STR: int, sizeof(ELEM_TYPE): 4
>>> [INFO] CL_ELEM_TYPE_STR: int2, sizeof(CL_ELEM_TYPE): 8
>>> [WARN] ELEM_TYPE(int) size differs from CL_ELEM_TYPE(int2)
>>> [INFO] len_a: 1048576, len_b: 1048576, len_c: 1048576
>>> [INFO] data_size_a: 4194304, data_size_b: 4194304, data_size_c: 4194304

============== CPU RESULT ==============
>>> [INFO] 0 times CPU starting...
0        42.547083
>>> [INFO] skip first 1 time(s)
>>> [INFO] CPU 1024x1024x1024 nan s nan GFLOPS

============== GPU RESULT ==============
>>> [INFO] Device name: Mali-T86x MP4 r2p0 0x0860
>>> [INFO] program_file: ./kernel.cl, kernel_func: mat_mult_vec2
>>> [INFO] global_work_size[3]: { 1024, 1024, 1 }
>>> [INFO] CL_GPU 20 times ./kernel.cl.mat_mult_vec2 starting ...
0        0.729240
>>> [INFO] skip first 1 time(s)
1        0.663346
2        0.663381
3        0.665677
4        0.688709
5        0.667141
6        0.675088
7        0.668971
8        0.677944
9        0.677716
10       0.662854
11       0.683053
12       0.663949
13       0.707164
14       0.664728
15       0.667829
16       0.669991
17       0.661924
18       0.691719
19       0.660779
20       0.726561
>>> [INFO] CL_GPU 1024x1024x1024 0.675426 s 3.179450 GFLOPS

>>> [TEST] correct rate: 1.0000
>>> [TEST] ~ Bingo ~ matrix a == matrix b

half2

$ ./matrixMultiplication 1024 1024 1024 ./kernel.cl mat_mult_vec2 0 20 $[1024] $[1024] 1                                                                                                                                                                                  [20/1029]
============== INIT ==============
>>> [INFO] ELEM_TYPE_STR: __fp16, sizeof(ELEM_TYPE): 2
>>> [INFO] CL_ELEM_TYPE_STR: half2, sizeof(CL_ELEM_TYPE): 2
>>> [INFO] len_a: 1048576, len_b: 1048576, len_c: 1048576
>>> [INFO] data_size_a: 2097152, data_size_b: 2097152, data_size_c: 2097152

============== CPU RESULT ==============
>>> [INFO] 0 times CPU starting...
0        21.096589
>>> [INFO] skip first 1 time(s)
>>> [INFO] CPU 1024x1024x1024 nan s nan GFLOPS

============== GPU RESULT ==============
>>> [INFO] Device name: Mali-T86x MP4 r2p0 0x0860
>>> [INFO] program_file: ./kernel.cl, kernel_func: mat_mult_vec2
>>> [INFO] global_work_size[3]: { 1024, 1024, 1 }
>>> [INFO] CL_GPU 20 times ./kernel.cl.mat_mult_vec2 starting ...
0        0.697708
>>> [INFO] skip first 1 time(s)
1        0.598590
2        0.598986
3        0.599458
4        0.598561
5        0.598886
6        0.600000
7        0.600150
8        0.603970
9        0.600956
10       0.601270
11       0.597891
12       0.598698
13       0.598759
14       0.599476
15       0.598256
16       0.599068
17       0.598711
18       0.599195
19       0.599339
20       0.599145
>>> [INFO] CL_GPU 1024x1024x1024 0.599468 s 3.582314 GFLOPS

>>> [TEST] correct rate: 1.0000
>>> [TEST] ~ Bingo ~ matrix a == matrix b

ysh329 commented 6 years ago

half4

$ ./matrixMultiplication 1024 1024 1024 ./vec4.cl mat_mult_vec4 0 20 $[1024] $[1024] 1
============== INIT ==============
>>> [INFO] ELEM_TYPE_STR: __fp16, sizeof(ELEM_TYPE): 2
>>> [INFO] CL_ELEM_TYPE_STR: half4, sizeof(CL_ELEM_TYPE): 2
>>> [INFO] len_a: 1048576, len_b: 1048576, len_c: 1048576
>>> [INFO] data_size_a: 2097152, data_size_b: 2097152, data_size_c: 2097152

============== CPU RESULT ==============
>>> [INFO] 0 times CPU starting...
0        24.388975
>>> [INFO] skip first 1 time(s)
>>> [INFO] CPU 1024x1024x1024 nan s nan GFLOPS

============== GPU RESULT ==============
>>> [INFO] Device name: Mali-T86x MP4 r2p0 0x0860
>>> [INFO] program_file: ./vec4.cl, kernel_func: mat_mult_vec4
>>> [INFO] global_work_size[3]: { 1024, 1024, 1 }
>>> [INFO] CL_GPU 20 times ./vec4.cl.mat_mult_vec4 starting ...
0        0.785168
>>> [INFO] skip first 1 time(s)
1        0.785488
2        0.785155
3        0.784971
4        0.785315
5        0.784797
6        0.784791
7        0.785390
8        0.785067
9        0.784999
10       0.784745
11       0.786073
12       0.784571
13       0.786385
14       0.784829
15       0.785454
16       0.784674
17       0.784963
18       0.786262
19       0.784866
20       0.785740
>>> [INFO] CL_GPU 1024x1024x1024 0.785227 s 2.734858 GFLOPS

>>> [TEST] correct rate: 1.0000
>>> [TEST] ~ Bingo ~ matrix a == matrix b

int4

$ ./matrixMultiplication 1024 1024 1024 ./vec4.cl mat_mult_vec4 0 20 $[1024] $[1024] 1                                                                                                                                                                                    [22/1482]
============== INIT ==============
>>> [INFO] ELEM_TYPE_STR: int, sizeof(ELEM_TYPE): 4
>>> [INFO] CL_ELEM_TYPE_STR: int4, sizeof(CL_ELEM_TYPE): 4
>>> [INFO] len_a: 1048576, len_b: 1048576, len_c: 1048576
>>> [INFO] data_size_a: 4194304, data_size_b: 4194304, data_size_c: 4194304

============== CPU RESULT ==============
>>> [INFO] 0 times CPU starting...
0        50.925559
>>> [INFO] skip first 1 time(s)
>>> [INFO] CPU 1024x1024x1024 nan s nan GFLOPS

============== GPU RESULT ==============
>>> [INFO] Device name: Mali-T86x MP4 r2p0 0x0860
>>> [INFO] program_file: ./vec4.cl, kernel_func: mat_mult_vec4
>>> [INFO] global_work_size[3]: { 1024, 1024, 1 }
>>> [INFO] CL_GPU 20 times ./vec4.cl.mat_mult_vec4 starting ...
0        1.227366
>>> [INFO] skip first 1 time(s)
1        1.232017
2        1.230932
3        1.227194
4        1.231709
5        1.232883
6        1.229543
7        1.228786
8        1.229696
9        1.227157
10       1.227646
11       1.227292
12       1.227880
13       1.228065
14       1.231296
15       1.226779
16       1.227354
17       1.227589
18       1.227081
19       1.229803
20       1.231007
>>> [INFO] CL_GPU 1024x1024x1024 1.229085 s 1.747221 GFLOPS

>>> [TEST] correct rate: 1.0000
>>> [TEST] ~ Bingo ~ matrix a == matrix b

float4

$ ./matrixMultiplication 1024 1024 1024 ./vec4.cl mat_mult_vec4 0 20 $[1024] $[1024] 1                                                                                                                                                                                    [22/1383]
============== INIT ==============
>>> [INFO] ELEM_TYPE_STR: float, sizeof(ELEM_TYPE): 4
>>> [INFO] CL_ELEM_TYPE_STR: float4, sizeof(CL_ELEM_TYPE): 4
>>> [INFO] len_a: 1048576, len_b: 1048576, len_c: 1048576
>>> [INFO] data_size_a: 4194304, data_size_b: 4194304, data_size_c: 4194304

============== CPU RESULT ==============
>>> [INFO] 0 times CPU starting...
0        33.002391
>>> [INFO] skip first 1 time(s)
>>> [INFO] CPU 1024x1024x1024 nan s nan GFLOPS

============== GPU RESULT ==============
>>> [INFO] Device name: Mali-T86x MP4 r2p0 0x0860
>>> [INFO] program_file: ./vec4.cl, kernel_func: mat_mult_vec4
>>> [INFO] global_work_size[3]: { 1024, 1024, 1 }
>>> [INFO] CL_GPU 20 times ./vec4.cl.mat_mult_vec4 starting ...
0        1.252484
>>> [INFO] skip first 1 time(s)
1        1.250876
2        1.249836
3        1.249639
4        1.224961
5        1.238079
6        1.249810
7        1.250827
8        1.253317
9        1.250303
10       1.248669
11       1.250466
12       1.249963
13       1.249999
14       1.249263
15       1.248959
16       1.250244
17       1.249647
18       1.249784
19       1.249077
20       1.248922
>>> [INFO] CL_GPU 1024x1024x1024 1.248132 s 1.720558 GFLOPS

>>> [TEST] correct rate: 1.0000
>>> [TEST] ~ Bingo ~ matrix a == matrix b

ysh329 commented 6 years ago

int8

$ ./matrixMultiplication 1024 1024 1024 ./vec8.cl mat_mult_vec8 0 20 $[1024] $[1024] 1                                                                                                                                                                                    [22/1585]
============== INIT ==============
>>> [INFO] ELEM_TYPE_STR: int, sizeof(ELEM_TYPE): 4
>>> [INFO] CL_ELEM_TYPE_STR: int8, sizeof(CL_ELEM_TYPE): 4
>>> [INFO] len_a: 1048576, len_b: 1048576, len_c: 1048576
>>> [INFO] data_size_a: 4194304, data_size_b: 4194304, data_size_c: 4194304

============== CPU RESULT ==============
>>> [INFO] 0 times CPU starting...
0        48.183118
>>> [INFO] skip first 1 time(s)
>>> [INFO] CPU 1024x1024x1024 nan s nan GFLOPS

============== GPU RESULT ==============
>>> [INFO] Device name: Mali-T86x MP4 r2p0 0x0860
>>> [INFO] program_file: ./vec8.cl, kernel_func: mat_mult_vec8
>>> [INFO] global_work_size[3]: { 1024, 1024, 1 }
>>> [INFO] CL_GPU 20 times ./vec8.cl.mat_mult_vec8 starting ...
0        1.245643
>>> [INFO] skip first 1 time(s)
1        1.247249
2        1.246787
3        1.247569
4        1.247150
5        1.246996
6        1.248807
7        1.246896
8        1.247675
9        1.248380
10       1.245348
11       1.249068
12       1.248908
13       1.248970
14       1.247565
15       1.251570
16       1.248810
17       1.250007
18       1.247615
19       1.248424
20       1.248822
>>> [INFO] CL_GPU 1024x1024x1024 1.248131 s 1.720560 GFLOPS

>>> [TEST] correct rate: 1.0000
>>> [TEST] ~ Bingo ~ matrix a == matrix b

half8

$ ./matrixMultiplication 1024 1024 1024 ./vec8.cl mat_mult_vec8 0 20 $[1024] $[1024] 1                                                                                                                                                                                    [22/1684]
============== INIT ==============
>>> [INFO] ELEM_TYPE_STR: __fp16, sizeof(ELEM_TYPE): 2
>>> [INFO] CL_ELEM_TYPE_STR: half8, sizeof(CL_ELEM_TYPE): 2
>>> [INFO] len_a: 1048576, len_b: 1048576, len_c: 1048576
>>> [INFO] data_size_a: 2097152, data_size_b: 2097152, data_size_c: 2097152

============== CPU RESULT ==============
>>> [INFO] 0 times CPU starting...
0        21.785163
>>> [INFO] skip first 1 time(s)
>>> [INFO] CPU 1024x1024x1024 nan s nan GFLOPS

============== GPU RESULT ==============
>>> [INFO] Device name: Mali-T86x MP4 r2p0 0x0860
>>> [INFO] program_file: ./vec8.cl, kernel_func: mat_mult_vec8
>>> [INFO] global_work_size[3]: { 1024, 1024, 1 }
>>> [INFO] CL_GPU 20 times ./vec8.cl.mat_mult_vec8 starting ...
0        1.064921
>>> [INFO] skip first 1 time(s)
1        1.060214
2        1.063404
3        1.066602
4        1.067873
5        1.063292
6        1.058918
7        1.064117
8        1.061240
9        1.066587
10       1.062076
11       1.065420
12       1.063350
13       1.059808
14       1.066744
15       1.065193
16       1.062735
17       1.067862
18       1.066636
19       1.063445
20       1.067638
>>> [INFO] CL_GPU 1024x1024x1024 1.064158 s 2.018013 GFLOPS

>>> [TEST] correct rate: 1.0000
>>> [TEST] ~ Bingo ~ matrix a == matrix b

float8

$ ./matrixMultiplication 1024 1024 1024 ./vec8.cl mat_mult_vec8 0 20 $[1024] $[1024] 1                                                                                                                                                                                    [22/1784]
============== INIT ==============
>>> [INFO] ELEM_TYPE_STR: float, sizeof(ELEM_TYPE): 4
>>> [INFO] CL_ELEM_TYPE_STR: float8, sizeof(CL_ELEM_TYPE): 4
>>> [INFO] len_a: 1048576, len_b: 1048576, len_c: 1048576
>>> [INFO] data_size_a: 4194304, data_size_b: 4194304, data_size_c: 4194304

============== CPU RESULT ==============
>>> [INFO] 0 times CPU starting...
0        32.317702
>>> [INFO] skip first 1 time(s)
>>> [INFO] CPU 1024x1024x1024 nan s nan GFLOPS

============== GPU RESULT ==============
>>> [INFO] Device name: Mali-T86x MP4 r2p0 0x0860
>>> [INFO] program_file: ./vec8.cl, kernel_func: mat_mult_vec8
>>> [INFO] global_work_size[3]: { 1024, 1024, 1 }
>>> [INFO] CL_GPU 20 times ./vec8.cl.mat_mult_vec8 starting ...
0        1.227704
>>> [INFO] skip first 1 time(s)
1        1.225342
2        1.226772
3        1.223779
4        1.225490
5        1.226510
6        1.226443
7        1.227413
8        1.225230
9        1.222004
10       1.226666
11       1.225248
12       1.224709
13       1.223577
14       1.224604
15       1.225972
16       1.225790
17       1.225214
18       1.224669
19       1.225131
20       1.225476
>>> [INFO] CL_GPU 1024x1024x1024 1.225302 s 1.752616 GFLOPS

>>> [TEST] correct rate: 1.0000
>>> [TEST] ~ Bingo ~ matrix a == matrix b

ysh329 commented 6 years ago

float16

$ ./matrixMultiplication 1024 1024 1024 ./vec16.cl mat_mult_vec16 0 20 $[1024] $[1024] 1                                                                                                                                                                                   [22/243]
============== INIT ==============
>>> [INFO] ELEM_TYPE_STR: float, sizeof(ELEM_TYPE): 4
>>> [INFO] CL_ELEM_TYPE_STR: float16, sizeof(CL_ELEM_TYPE): 4
>>> [INFO] len_a: 1048576, len_b: 1048576, len_c: 1048576
>>> [INFO] data_size_a: 4194304, data_size_b: 4194304, data_size_c: 4194304

============== CPU RESULT ==============
>>> [INFO] 0 times CPU starting...
0        32.313083
>>> [INFO] skip first 1 time(s)
>>> [INFO] CPU 1024x1024x1024 nan s nan GFLOPS

============== GPU RESULT ==============
>>> [INFO] Device name: Mali-T86x MP4 r2p0 0x0860
>>> [INFO] program_file: ./vec16.cl, kernel_func: mat_mult_vec16
>>> [INFO] global_work_size[3]: { 1024, 1024, 1 }
>>> [INFO] CL_GPU 20 times ./vec16.cl.mat_mult_vec16 starting ...
0        1.584943
>>> [INFO] skip first 1 time(s)
1        1.570398
2        1.582975
3        1.572269
4        1.585434
5        1.571658
6        1.585436
7        1.572811
8        1.583057
9        1.572311
10       1.585094
11       1.573029
12       1.588597
13       1.572048
14       1.585594
15       1.571509
16       1.588572
17       1.571195
18       1.588734
19       1.578679
20       1.586458
>>> [INFO] CL_GPU 1024x1024x1024 1.579293 s 1.359775 GFLOPS

>>> [TEST] correct rate: 1.0000
>>> [TEST] ~ Bingo ~ matrix a == matrix b

half16

$ ./matrixMultiplication 1024 1024 1024 ./vec16.cl mat_mult_vec16 0 20 $[1024] $[1024] 1                                                                                                                                                                                   [22/297]
============== INIT ==============
>>> [INFO] ELEM_TYPE_STR: __fp16, sizeof(ELEM_TYPE): 2
>>> [INFO] CL_ELEM_TYPE_STR: half16, sizeof(CL_ELEM_TYPE): 2
>>> [INFO] len_a: 1048576, len_b: 1048576, len_c: 1048576
>>> [INFO] data_size_a: 2097152, data_size_b: 2097152, data_size_c: 2097152

============== CPU RESULT ==============
>>> [INFO] 0 times CPU starting...
0        21.933312
>>> [INFO] skip first 1 time(s)
>>> [INFO] CPU 1024x1024x1024 nan s nan GFLOPS

============== GPU RESULT ==============
>>> [INFO] Device name: Mali-T86x MP4 r2p0 0x0860
>>> [INFO] program_file: ./vec16.cl, kernel_func: mat_mult_vec16
>>> [INFO] global_work_size[3]: { 1024, 1024, 1 }
>>> [INFO] CL_GPU 20 times ./vec16.cl.mat_mult_vec16 starting ...
0        1.113329
>>> [INFO] skip first 1 time(s)
1        1.109171
2        1.112040
3        1.118125
4        1.108358
5        1.112967
6        1.115943
7        1.109845
8        1.109008
9        1.110358
10       1.110518
11       1.105643
12       1.113373
13       1.110546
14       1.109875
15       1.106689
16       1.116894
17       1.114391
18       1.115844
19       1.108951
20       1.108160
>>> [INFO] CL_GPU 1024x1024x1024 1.111335 s 1.932346 GFLOPS

int16

$ ./matrixMultiplication 1024 1024 1024 ./vec16.cl mat_mult_vec16 0 20 $[1024] $[1024] 1                                                                                                                                                                                   [22/351]
============== INIT ==============
>>> [INFO] ELEM_TYPE_STR: int, sizeof(ELEM_TYPE): 4
>>> [INFO] CL_ELEM_TYPE_STR: int16, sizeof(CL_ELEM_TYPE): 4
>>> [INFO] len_a: 1048576, len_b: 1048576, len_c: 1048576
>>> [INFO] data_size_a: 4194304, data_size_b: 4194304, data_size_c: 4194304

============== CPU RESULT ==============
>>> [INFO] 0 times CPU starting...
0        44.288010
>>> [INFO] skip first 1 time(s)
>>> [INFO] CPU 1024x1024x1024 nan s nan GFLOPS

============== GPU RESULT ==============
>>> [INFO] Device name: Mali-T86x MP4 r2p0 0x0860
>>> [INFO] program_file: ./vec16.cl, kernel_func: mat_mult_vec16
>>> [INFO] global_work_size[3]: { 1024, 1024, 1 }
>>> [INFO] CL_GPU 20 times ./vec16.cl.mat_mult_vec16 starting ...
0         1.560015
>>> [INFO] skip first 1 time(s)
1        1.549140
2        1.556859
3        1.546036
4        1.557132
5        1.550042
6        1.560190
7        1.551438
8        1.563459
9        1.548408
10       1.557484
11       1.551499
12       1.558829
13       1.551887
14       1.558873
15       1.549608
16       1.557619
17       1.553162
18       1.558806
19       1.553744
20       1.558716
>>> [INFO] CL_GPU 1024x1024x1024 1.554647 s 1.381332 GFLOPS

>>> [TEST] correct rate: 1.0000
>>> [TEST] ~ Bingo ~ matrix a == matrix b

ysh329 commented 6 years ago

The scale of matrix multiplication is 1024 x 1024 x 1024, and task size equals to global work size situation (unit is seconds):

int 0.587161 int2 0.675426 int4 1.229085 int8 1.248131 int16 1.554647

float 0.590778 float2 0.678064 float4 1.248132 float8 1.225302 float16 1.579293

half 0.505549 half2 0.599468 half4 0.785227 half8 1.064158 half16 1.111335

ysh329 / OpenCL-101

benchmark for various type (floatN, intN, halfN, doubleN, shortN) using naive implementation #11

NV Card

M: 1024; N: 1024; K: 1024

Naive Kernel

Mali-T860

Naive kernel

float

int

half

double

float2

int2

half2

half4

int4

float4

int8

half8

float8

float16

half16

int16