Open ysh329 opened 4 years ago
https://developer.qualcomm.com/software/adreno-gpu-sdk/faq
This SDK includes usage examples for Qualcomm Technologies extensions to OpenCL including:
The OpenCL SDK version 1.2 contains many new examples, including:
Adreno GPU SDK - Tutorial Videos - Qualcomm Developer Network https://developer.qualcomm.com/software/adreno-gpu-sdk/tutorial-videos
Adreno Hardware Tutorial 1: Graphics Pipeline Overview Watch an overview of the mobile Graphics Pipeline which describes the data flow and processing of graphics data from software to hardware. The APIs available on mobile devices are also covered.
Adreno Hardware Tutorial 2: Introduction to the Adreno GPU The Adreno Graphics Processing Unit (GPU) is designed to bring console and PC quality 3D games to mobile devices. This video introduces the viewer to the Adreno GPU, some of its key features, and how it relates to the Qualcomm Snapdragon processor.
Adreno Hardware Tutorial 3: Tile Based Rendering This video investigates the commonly used technique of Tile Based Rendering on mobile devices by focusing specifically on Adreno GPUs using a toon shading sample.
Adreno Hardware Tutorial 4: Detecting Adreno GPU This video demonstrates a simple way to detect Adreno GPUs using C++ (Native) code.
OpenCL Tips · yszheda/wiki Wiki https://github.com/yszheda/wiki/wiki/OpenCL-Tips
Sub-optimal performance on Qualcomm Adreno GPUs · Issue #228 · CNugteren/CLBlast https://github.com/CNugteren/CLBlast/issues/228
Float16 GEMM on Adreno 330 · Issue #181 · CNugteren/CLBlast https://github.com/CNugteren/CLBlast/issues/181
do not have a certain result of float16
local work size和work group size
Opencl global work size vs local work size
In both cases the global size is 1024. In case 1, the local size is 128 and this results in an execution partition that creates 8 work-groups, each of which will iterate through 128 work-items. In case 2, the local size is changed to 256 and this results in 4 work-groups, each with 256 work-items.
Understanding Kernels, Work-groups and Work-items — TI OpenCL User's Guide https://downloads.ti.com/mctools/esd/docs/opencl/execution/kernels-workgroups-workitems.html
double OpenCLRuntime::getCostTime(const cl::Event *event){
mCommandQueuePtr->finish();
mStartNanos = event->getProfilingInfo<CL_PROFILING_COMMAND_START>();
mStopNanos = event->getProfilingInfo<CL_PROFILING_COMMAND_END>();
return (mStopNanos - mStartNanos) / 1000000.0;
}
double OpenCLRuntime::getQueuedTime(const cl::Event *event){
mCommandQueuePtr->finish();
return (event->getProfilingInfo<CL_PROFILING_COMMAND_START>() - event->getProfilingInfo<CL_PROFILING_COMMAND_QUEUED>()) / 1000000.0;
}
double OpenCLRuntime::getSubmitTime(const cl::Event *event){
mCommandQueuePtr->finish();
return (event->getProfilingInfo<CL_PROFILING_COMMAND_START>() - event->getProfilingInfo<CL_PROFILING_COMMAND_SUBMIT>()) / 1000000.0;
}
GPU 优势
GPU达到CPU最高帧率时的功率消耗只有CPU的一半。这段话来自An Independent Evaluation of Implementing Computer Vision Functions with OpenCL on the Qualcomm Adreno 420 | Berkeley Design Technology, Inc. July 2015,原文如下:
其实这篇基于Adreno430的文章要点如下:算法实现必须最大限度地提高并行性,并符合GPU的内存系统和核心架构,文章讨论了这几点:
CL_INVALID_KERNEL_ARGS
CL_INVALID_KERNEL_ARGS if the kernel argument values have not been specified.
clEnqueueNDRangeKernel https://www.khronos.org/registry/OpenCL/sdk/1.0/docs/man/xhtml/clEnqueueNDRangeKernel.html
printf_buffer_metadata corrupt!
Var
a8x4
is offloat vector
type, but format symbol%d
used, which should correct as below:Debug
有个
printf
函数可以用,非常方便,此外也可以打印vector
矢量,khronos.org
的OpenCL 1.2和2.0的文档对这个printf Function说明是一样的。这里提一下打印矢量的方法:printf("f4 = %2.2v4hlf\n", f);
,其中f4
是float4
类型。目前发现能打印的主要是高通骁龙SoC的GPU,但是骁龙系列也有例外,遇到似乎是骁龙410的GPU在加入printf后,在ADB Shell环境执行,会卡主,注释掉
printf
就不会,可能这个410不支持printf
?这个不确定。但mali是没法打印的。PRINT_KERNEL
来控制。相当于两个模式:benchmark和debug,方便切换;for
从1开始循环各个值,拿到错误的最小规模来排查。尤其是对于Image2D的方式实现的kernel,找到小规模比较方便,另外是需要写一段CPU的代码来模拟Image2D,更好更快的定位这个出错小规模下Image2D和Buffer形式的索引转换,例如对应的错误起始的线程global_work_size对应的id分别为(0,0,1)为起始的计算错误的线程,用CPU模拟找到对应的Buffer形式的索引值;for
循环规模大小,可以基于kernel实现的规模来设定,比方gemm的kernel每个线程计算4x8
的C
矩阵,那么单元测试的最大规模可以跑(4 + 3) x (8 + 7)
刚好大于这个规模作为上界(检查少于和超出边界的情况),每次自增1,这样的小规模也方便后期排查错误时复现和手算;global_id
以及各自分量所代表的参数值,如row
、col
、batch_idx
;LOOP_TEST
定义来遍历各种从最小规模到最大规模,同时针对#else
的情况,设置检查单个出错的bad case;#if 0
、#else
来控制,确定哪部分出错,也可以用该方法来调试CPU的kernel;c = a * b
,加载a
和b
时,因为列优先(列主序列)存储的方式,若a
和b
的列数小于4,每次加载以矢量vload4
加载4个,那么比方float4 b4x4[4]
的第一个元素b4x4[0]
中的b4x4[0].s1
、b4x4[0].s2
、b4x4[0].s3
就会加载出错,矩阵b
和c
也有这个问题,但实际并没关系,因为存的时候处理好了边界。并不会将这些值写入结果中;if-else
判断,控制打印哪一个线程便于排查问题(也方便手算)。PRINT_KERNEL
的宏定义计算结果正确,关闭就部分错误。猜想是出现内存复写的情况,打印出内存地址,发现确实存在两个不同线程对同一地址的复写(前不久确实发现有两个线程出现内存地址复写的情况,有两个不同的线程,打印出来了相同的地址。原因是在保存结果矩阵的时候,对边界遍历的for
的上界设置的条件不足导致的);更多方式可以看How to debug — MACE documentation
性能
local work size
设置。上调试发现local work size
默认的NullRange
和我设置的{16, 16}
,跑mobilentv1性能没啥差别,需要进一步查看,可以搜搜;global work size
的三个值的排布顺序试试看性能变化,搜一下;Buffer Vs. Image