ysh329 / OpenCL-101

Learn OpenCL step by step.
123 stars 31 forks source link

Bug of floatN/halfN support: fail to copy data from device to host #6

Closed ysh329 closed 6 years ago

ysh329 commented 6 years ago

Log

>>> [ERROR] failed to copy data from device to host.-14

I lookuped error code -14 from this link, saying:

ysh329 commented 6 years ago

This problem is directly related to input-variable-type of OCL kernel function.

float2

Problem Code

__kernel void global_bandwidth_vec2(const int heightA, const int widthA, __global const CL_ELEM_TYPE *a, __global CL_ELEM_TYPE *b) {
    const int idx = get_global_id(0);
    const int step = idx << 1;

    CL_ELEM_TYPE value = *((__global CL_ELEM_TYPE *)(a + step));
    *((__global CL_ELEM_TYPE *)(b + step)) = value;
}

Execution log:

yuanshuai@firefly:~/code/OpenCL-101/bandwidth$ ./bandwidth 1024 1024 ./kernel.cl global_bandwidth_vec2 1 $[1024*1024/2] 1 1
>>> [INFO] ELEM_TYPE_STR: float, sizeof(ELEM_TYPE): 4
>>> [INFO] CL_ELEM_TYPE_STR: float2, sizeof(CL_ELEM_TYPE): 4
>>> [INFO] cl_program_build_options: -D CL_ELEM_TYPE=float2
len: 1048576, data_size: 4194304, a_h: 0x7f84170010 a_h+1: 0x7f84170011 
============== CPU RESULT ==============
>>> [INFO] 1 times CPU starting...
>>> [INFO] CPU 1024 x 1024 0.001977 s 1060.774911 MFLOPS

>>> [INFO] bandwidth: 3.95 GB/s
>>> [TEST] correct rate: 1.0000
>>> [TEST] ~ Bingo ~ matrix a == matrix b

============== GPU RESULT ==============
Mali-T86x MP4 r2p0 0x0860
>>> [INFO]program_file: ./kernel.cl
>>> [INFO] kernel_func: global_bandwidth_vec2
>>> [INFO] global_work_size[3]: { 524288, 1, 1 }
>>> [WARN] global work size (524288) is smaller than task size (1048576).

>>> [INFO] 1 times ./kernel.cl.global_bandwidth_vec2 starting...
>>> [INFO] skip first time.
>>> [INFO] CL_GPU 1024 x 1024 0.001234 s 1699.474878 MFLOPS ./kernel.cl

>>> [INFO] bandwidth: 6.33 GB/s
>>> [ERROR] failed to copy data from device to host.-14

No-Problem Code

__kernel void global_bandwidth_vec2(const int heightA, const int widthA, __global const float *a, __global float *b) {
    const int idx = get_global_id(0);
    const int step = idx << 1;

    CL_ELEM_TYPE value = *((__global CL_ELEM_TYPE *)(a + step));
    *((__global CL_ELEM_TYPE *)(b + step)) = value;
}

Execution log:

yuanshuai@firefly:~/code/OpenCL-101/bandwidth$ ./bandwidth 1024 1024 ./kernel.cl global_bandwidth_vec2 1 $[1024*1024/2] 1 1                                                                                                                                                       
>>> [INFO] ELEM_TYPE_STR: float, sizeof(ELEM_TYPE): 4
>>> [INFO] CL_ELEM_TYPE_STR: float2, sizeof(CL_ELEM_TYPE): 4
>>> [INFO] cl_program_build_options: -D CL_ELEM_TYPE=float2
len: 1048576, data_size: 4194304, a_h: 0x7f9cf8f010 a_h+1: 0x7f9cf8f011 
============== CPU RESULT ==============
>>> [INFO] 1 times CPU starting...
>>> [INFO] CPU 1024 x 1024 0.001978 s 1060.238625 MFLOPS

>>> [INFO] bandwidth: 3.95 GB/s
>>> [TEST] correct rate: 1.0000
>>> [TEST] ~ Bingo ~ matrix a == matrix b

============== GPU RESULT ==============
Mali-T86x MP4 r2p0 0x0860
>>> [INFO]program_file: ./kernel.cl
>>> [INFO] kernel_func: global_bandwidth_vec2
>>> [INFO] global_work_size[3]: { 524288, 1, 1 }
>>> [WARN] global work size (524288) is smaller than task size (1048576).

>>> [INFO] 1 times ./kernel.cl.global_bandwidth_vec2 starting...
>>> [INFO] skip first time.
>>> [INFO] CL_GPU 1024 x 1024 0.002157 s 972.254057 MFLOPS ./kernel.cl

>>> [INFO] bandwidth: 3.62 GB/s
>>> [TEST] correct rate: 1.0000
>>> [TEST] ~ Bingo ~ matrix a == matrix b

half16

No-Problem Code

__kernel void global_bandwidth_vec16(const int heightA, const int widthA, __global const half *a, __global half *b) {
    const int idx = get_global_id(0);
    const int step = idx << 4;

    CL_ELEM_TYPE value = *((__global CL_ELEM_TYPE *)(a + step));
    *((__global CL_ELEM_TYPE *)(b + step)) = value;
}

Execution Log:

yuanshuai@firefly:~/code/OpenCL-101/bandwidth$ ./bandwidth 1024 1024 ./kernel.cl global_bandwidth_vec16 1 $[1024*1024/16] 1 1                                    
>>> [INFO] ELEM_TYPE_STR: __fp16, sizeof(ELEM_TYPE): 2
>>> [INFO] CL_ELEM_TYPE_STR: half16, sizeof(CL_ELEM_TYPE): 2
>>> [INFO] cl_program_build_options: -D CL_ELEM_TYPE=half16
len: 1048576, data_size: 2097152, a_h: 0x7f8ac7f010 a_h+1: 0x7f8ac7f011 
============== CPU RESULT ==============
>>> [INFO] 1 times CPU starting...
>>> [INFO] CPU 1024 x 1024 0.001030 s 2036.069903 MFLOPS

>>> [INFO] bandwidth: 3.79 GB/s
>>> [TEST] correct rate: 1.0000
>>> [TEST] ~ Bingo ~ matrix a == matrix b

============== GPU RESULT ==============
Mali-T86x MP4 r2p0 0x0860
>>> [INFO]program_file: ./kernel.cl
>>> [INFO] kernel_func: global_bandwidth_vec16
>>> [INFO] global_work_size[3]: { 65536, 1, 1 }
>>> [WARN] global work size (65536) is smaller than task size (1048576).

>>> [INFO] 1 times ./kernel.cl.global_bandwidth_vec16 starting...
>>> [INFO] skip first time.
>>> [INFO] CL_GPU 1024 x 1024 0.001019 s 2058.049068 MFLOPS ./kernel.cl

>>> [INFO] bandwidth: 3.83 GB/s
>>> [TEST] correct rate: 1.0000
>>> [TEST] ~ Bingo ~ matrix a == matrix b

Problem Code

__kernel void global_bandwidth_vec16_v2(const int heightA, const int widthA, __global half16 *a, __global half16 *b) {
    const int idx = get_global_id(0);
    const int step = idx << 4;

    half16 value = *(a + step);
    *(b + step) = value;
}

Execution log:

yuanshuai@firefly:~/code/OpenCL-101/bandwidth$ ./bandwidth 1024 1024 ./kernel.cl global_bandwidth_vec16_v2 1 $[1024*1024/16] 1 1
>>> [INFO] ELEM_TYPE_STR: __fp16, sizeof(ELEM_TYPE): 2
>>> [INFO] CL_ELEM_TYPE_STR: half16, sizeof(CL_ELEM_TYPE): 2
>>> [INFO] cl_program_build_options: -D CL_ELEM_TYPE=half16
len: 1048576, data_size: 2097152, a_h: 0x7f844ab010 a_h+1: 0x7f844ab011 
============== CPU RESULT ==============
>>> [INFO] 1 times CPU starting...
>>> [INFO] CPU 1024 x 1024 0.001205 s 1740.375104 MFLOPS

>>> [INFO] bandwidth: 3.24 GB/s
>>> [TEST] correct rate: 1.0000
>>> [TEST] ~ Bingo ~ matrix a == matrix b

============== GPU RESULT ==============
Mali-T86x MP4 r2p0 0x0860
>>> [INFO]program_file: ./kernel.cl
>>> [INFO] kernel_func: global_bandwidth_vec16_v2
>>> [INFO] global_work_size[3]: { 65536, 1, 1 }
>>> [WARN] global work size (65536) is smaller than task size (1048576).

>>> [INFO] 1 times ./kernel.cl.global_bandwidth_vec16_v2 starting...
>>> [INFO] skip first time.
>>> [INFO] CL_GPU 1024 x 1024 0.001318 s 1591.162367 MFLOPS ./kernel.cl

>>> [INFO] bandwidth: 2.96 GB/s
>>> [ERROR] failed to copy data from device to host.-14
ysh329 commented 6 years ago

That is, if using typeN on device, the input-variable-type of kernel function must be type, not typeN.

ysh329 commented 6 years ago

half

/*===================  CHANGE TYPE HERE ===================*/
// Type on Host (CPU)
#define     ELEM_TYPE                       __fp16
#define     ELEM_TYPE_STR                   "__fp16"
// Type on Device (GPU)
#define     CL_ELEM_TYPE                    cl_half
#define     CL_ELEM_TYPE_STR                "half2