Open ysh329 opened 3 years ago
We’ve designed and integrated a new OpenCL tuner in the coming 19.05 release.
In fact, in the last few months we’ve seen a huge amount of interest in this simple but effective method, which finds the optimal Local-Work-Group Size (LWS) for each OpenCL kernel configuration to deliver high performance on Mali GPU.
Don’t worry if you don’t know what the LWS is or how the OpenCL tuner can be used in the Compute Library. I’ve got a couple of resources that will tell you all you need to know.
Resource #1 is a presentation that I gave at the Embedded Vision Summit 2018. The presentation looks at how Winograd convolution layers work but also gives an overview of OpenCL, including the LWS. Resource #2 is documentation that explains how the OpenCL tuner can be used in the Compute Library, along with few recommendations.
![Uploading image.png…]()
![Uploading image.png…]()
动态随机存取存储器(Dynamic Random Access Memory,DRAM)是一种半导体存储器,主要的作用原理是利用电容内存储电荷的多寡来代表一个二进制比特(bit)是1还是0。由于在现实中晶体管会有漏电电流的现象,导致电容上所存储的电荷数量并不足以正确的判别数据,而导致数据毁损。因此对于DRAM来说,周期性地充电是一个无可避免的要件。由于这种需要定时刷新的特性,因此被称为“动态”存储器。相对来说,静态存储器(SRAM)只要存入数据后,纵使不刷新也不会丢失记忆。
111
Making the most of Arm NN for GPU inference: OpenCL Tuner https://community.arm.com/developer/ip-products/processors/b/ml-ip-blog/posts/arm-nn-gpu-inference-with-opencl-tuner
OpenCL tuner
ACL implements the so-called Local Work-group Size (LWS) tuner. The idea is to improve the cache utilization at L1 and L2 levels and reduce accessing global memory as much as possible.
Figure 2 shows a basic representation of OpenCL architecture. The compute device can be a GPU, a CPU, or an accelerator. Inside the compute device we have several compute units (GPU core, CPU core, and so on). Each of them has its own L1 memory cache and can execute N threads in parallel, known as work-items. Each thread executes the same piece of code corresponding to an OpenCL kernel, where the thread Id is used to access different memory locations.
Figure 2: OpenCL architecture and memory caches.
To improve L1 memory cache utilization we want the threads of the same work-group to access consecutive memory addresses (memory coalescing).
To optimize L2 cache utilization, we want the compute units to reuse the same memory block.
为了提高一级内存缓存的利用率,我们希望同一工作组的线程访问连续的内存地址(内存合并);
为了优化二级缓存利用率,我们希望计算单元重用相同的内存块。
To achieve these optimizations for L1 and L2 memory caches, the ACL implements a Local Work-group Size (LWS) tuner to find the optimal configuration to use for each OpenCL kernel type. For a more detailed explanation, you can read this blog and watch this presentation. The impact on the inference performance of the LWS tuner can be huge. This is between 1.12 and 1.8 for different networks, as you can see in the picture below for the three different CL Tuner modes.
前面的图片显示了启用OpenCL Tuner之前(顶部)和之后(底部)的流线型捕获。重点关注GPU使用部分中的非片段队列活动(橙色曲线),突出显示的间隔显示GPU上ML推断过程的开始和结束。请注意,启用调谐器后,与启用调谐器之前的推断间隔(24ms)相比,推断间隔更短(18ms)。这意味着推理性能提高了25%。根据硬件和网络类型的不同,改进是不同的。图片中显示的截图与智能手机视频流上的Unity应用程序中运行在Mali-G72 MP12 GPU上的分割网络的推断相对应。