ysh329 / OpenCL-101

Learn OpenCL step by step.
123 stars 31 forks source link

common Error Q&A #28

Open ysh329 opened 4 years ago

ysh329 commented 4 years ago

GPU 优势

GPU达到CPU最高帧率时的功率消耗只有CPU的一半。这段话来自An Independent Evaluation of Implementing Computer Vision Functions with OpenCL on the Qualcomm Adreno 420 | Berkeley Design Technology, Inc. July 2015,原文如下:

Qualcomm has reported that the GPU mode of the demo consumes half as much power as the CPU mode when throttling the frame rate of the GPU mode to match the highest frame rate achieved in the CPU mode.

其实这篇基于Adreno430的文章要点如下:算法实现必须最大限度地提高并行性,并符合GPU的内存系统和核心架构,文章讨论了这几点:

  1. 最小化GPU和CPU之间的的内存拷贝:snapdragon855使用adreno640的GPU,根据OpenCL-Z有如下数据
    • Host to Device: 10.51 GByte/s
    • Device to Host: 4.54 GByte/s
    • Device to Device: 23.12 GByte/s 换句话说,要避免模型串联计算的时候的CPU、GPU交叉调用,尤其是当下一层的feature map特别大的情况下,还要把计算交给GPU来做,因为下载数据慢很可能就不划算;
  2. 小心管理有限的快速本地内存(Local Memory)。
  3. 即使用高级语言(如OpenCL),也必须掌握GPU的核心体系结构特征,让编程符合架构特征来做优化。例如,代码必须减少分支的使用,并注意使用最合适的SIMD数据类型。

CL_INVALID_KERNEL_ARGS

CL_INVALID_KERNEL_ARGS if the kernel argument values have not been specified.

clEnqueueNDRangeKernel https://www.khronos.org/registry/OpenCL/sdk/1.0/docs/man/xhtml/clEnqueueNDRangeKernel.html

printf_buffer_metadata corrupt!

printf("===== a:%d\n", a8x4[0].s0);

Var a8x4 is of float vector type, but format symbol %d used, which should correct as below:

printf("===== a:%f\n", a8x4[0].s0);

Debug

有个printf函数可以用,非常方便,此外也可以打印vector矢量,khronos.org的OpenCL 1.2和2.0的文档对这个printf Function说明是一样的。这里提一下打印矢量的方法:printf("f4 = %2.2v4hlf\n", f);,其中f4float4类型。

目前发现能打印的主要是高通骁龙SoC的GPU,但是骁龙系列也有例外,遇到似乎是骁龙410的GPU在加入printf后,在ADB Shell环境执行,会卡主,注释掉printf就不会,可能这个410不支持printf?这个不确定。但mali是没法打印的。


        #ifdef PRINT_KERNEL
        if (row == 0 && col == 0 && bidx == 0) {
            for (int i = 0; i < 8; ++i) {
                printf("row = col = bidx = 0 initialize c8x4[%d] = %2v4hlf\n", i, c8x4[i]);
            }
        }
        #endif

更多方式可以看How to debug — MACE documentation

性能

Buffer Vs. Image

image

ysh329 commented 4 years ago

Adreno GPU SDK - FAQs - Qualcomm Developer Network

https://developer.qualcomm.com/software/adreno-gpu-sdk/faq

What is included in the Adreno SDK for OpenCL?

This SDK includes usage examples for Qualcomm Technologies extensions to OpenCL including:

What is new in the Adreno SDK for OpenCL v1.2?

The OpenCL SDK version 1.2 contains many new examples, including:

OpenCL Optimization from Qualcomm

OpenCL Ref. from Qualcomm

Hardware Tutorials

Adreno GPU SDK - Tutorial Videos - Qualcomm Developer Network https://developer.qualcomm.com/software/adreno-gpu-sdk/tutorial-videos

Others

ysh329 commented 4 years ago

OpenCL Tips · yszheda/wiki Wiki https://github.com/yszheda/wiki/wiki/OpenCL-Tips

ysh329 commented 4 years ago

Sub-optimal performance on Qualcomm Adreno GPUs · Issue #228 · CNugteren/CLBlast https://github.com/CNugteren/CLBlast/issues/228

ysh329 commented 4 years ago

Float16 GEMM on Adreno 330 · Issue #181 · CNugteren/CLBlast https://github.com/CNugteren/CLBlast/issues/181

do not have a certain result of float16

ysh329 commented 4 years ago

local work size和work group size

Opencl global work size vs local work size

In both cases the global size is 1024. In case 1, the local size is 128 and this results in an execution partition that creates 8 work-groups, each of which will iterate through 128 work-items. In case 2, the local size is changed to 256 and this results in 4 work-groups, each with 256 work-items.

Understanding Kernels, Work-groups and Work-items — TI OpenCL User's Guide https://downloads.ti.com/mctools/esd/docs/opencl/execution/kernels-workgroups-workitems.html

ysh329 commented 4 years ago
double OpenCLRuntime::getCostTime(const cl::Event *event){
    mCommandQueuePtr->finish();
    mStartNanos = event->getProfilingInfo<CL_PROFILING_COMMAND_START>();
    mStopNanos = event->getProfilingInfo<CL_PROFILING_COMMAND_END>();
    return (mStopNanos - mStartNanos) / 1000000.0;
}

double OpenCLRuntime::getQueuedTime(const cl::Event *event){
    mCommandQueuePtr->finish();
    return (event->getProfilingInfo<CL_PROFILING_COMMAND_START>() - event->getProfilingInfo<CL_PROFILING_COMMAND_QUEUED>()) / 1000000.0;
}

double OpenCLRuntime::getSubmitTime(const cl::Event *event){
    mCommandQueuePtr->finish();
    return (event->getProfilingInfo<CL_PROFILING_COMMAND_START>() - event->getProfilingInfo<CL_PROFILING_COMMAND_SUBMIT>()) / 1000000.0;
}