ysh329 / OpenCL-101

Learn OpenCL step by step.
123 stars 31 forks source link

【竞品调研】Making the most of Arm NN for GPU inference: OpenCL Tuner #35

Open ysh329 opened 3 years ago

ysh329 commented 3 years ago

Making the most of Arm NN for GPU inference: OpenCL Tuner https://community.arm.com/developer/ip-products/processors/b/ml-ip-blog/posts/arm-nn-gpu-inference-with-opencl-tuner

OpenCL tuner

ACL implements the so-called Local Work-group Size (LWS) tuner. The idea is to improve the cache utilization at L1 and L2 levels and reduce accessing global memory as much as possible.

Figure 2 shows a basic representation of OpenCL architecture. The compute device can be a GPU, a CPU, or an accelerator. Inside the compute device we have several compute units (GPU core, CPU core, and so on). Each of them has its own L1 memory cache and can execute N threads in parallel, known as work-items. Each thread executes the same piece of code corresponding to an OpenCL kernel, where the thread Id is used to access different memory locations.

image image Figure 2: OpenCL architecture and memory caches.

To achieve these optimizations for L1 and L2 memory caches, the ACL implements a Local Work-group Size (LWS) tuner to find the optimal configuration to use for each OpenCL kernel type. For a more detailed explanation, you can read this blog and watch this presentation. The impact on the inference performance of the LWS tuner can be huge. This is between 1.12 and 1.8 for different networks, as you can see in the picture below for the three different CL Tuner modes.


image

前面的图片显示了启用OpenCL Tuner之前(顶部)和之后(底部)的流线型捕获。重点关注GPU使用部分中的非片段队列活动(橙色曲线),突出显示的间隔显示GPU上ML推断过程的开始和结束。请注意,启用调谐器后,与启用调谐器之前的推断间隔(24ms)相比,推断间隔更短(18ms)。这意味着推理性能提高了25%。根据硬件和网络类型的不同,改进是不同的。图片中显示的截图与智能手机视频流上的Unity应用程序中运行在Mali-G72 MP12 GPU上的分割网络的推断相对应。

ysh329 commented 3 years ago

https://www.youtube.com/watch?v=6lvzMB56Jnc&feature=youtu.be&ab_channel=EdgeAIandVisionAlliance

ysh329 commented 3 years ago

A new OpenCL tuner, in all flavors

We’ve designed and integrated a new OpenCL tuner in the coming 19.05 release.

In fact, in the last few months we’ve seen a huge amount of interest in this simple but effective method, which finds the optimal Local-Work-Group Size (LWS) for each OpenCL kernel configuration to deliver high performance on Mali GPU.

Don’t worry if you don’t know what the LWS is or how the OpenCL tuner can be used in the Compute Library. I’ve got a couple of resources that will tell you all you need to know.

Resource #1 is a presentation that I gave at the Embedded Vision Summit 2018. The presentation looks at how Winograd convolution layers work but also gives an overview of OpenCL, including the LWS. Resource #2 is documentation that explains how the OpenCL tuner can be used in the Compute Library, along with few recommendations.

ysh329 commented 3 years ago

Even Faster CNNs Exploring the New Class of Winograd Algorithms

https://www.bilibili.com/video/av53072685/

image

ysh329 commented 3 years ago

image ![Uploading image.png…]()

ysh329 commented 3 years ago

image ![Uploading image.png…]()

ysh329 commented 3 years ago

image image

ysh329 commented 3 years ago

image 动态随机存取存储器(Dynamic Random Access Memory,DRAM)是一种半导体存储器,主要的作用原理是利用电容内存储电荷的多寡来代表一个二进制比特(bit)是1还是0。由于在现实中晶体管会有漏电电流的现象,导致电容上所存储的电荷数量并不足以正确的判别数据,而导致数据毁损。因此对于DRAM来说,周期性地充电是一个无可避免的要件。由于这种需要定时刷新的特性,因此被称为“动态”存储器。相对来说,静态存储器(SRAM)只要存入数据后,纵使不刷新也不会丢失记忆。

https://baike.baidu.com/item/%E5%8A%A8%E6%80%81%E9%9A%8F%E6%9C%BA%E5%AD%98%E5%8F%96%E5%AD%98%E5%82%A8%E5%99%A8

ysh329 commented 3 years ago

image

ysh329 commented 3 years ago

image

ysh329 commented 3 years ago

image

ysh329 commented 3 years ago

111 image

ysh329 commented 3 years ago

image