Understand cl / cuda memory architecture

ytgui commented 5 years ago

OpenCL Memory Model

Overview

Description

OpenCL defines a four-level memory hierarchy for the compute device:

global memory: shared by all processing elements, but has high access latency (__global);
read-only memory: smaller, low latency, writable by the host CPU but not the compute devices (__constant);
local memory: shared by a group of processing elements (__local);
per-element private memory (registers; __private).

Not every device needs to implement each level of this hierarchy in hardware. Consistency between the various levels in the hierarchy is relaxed, and only enforced by explicit synchronization constructs, notably barriers.

Devices may or may not share memory with the host CPU.[13] The host API provides handles on device memory buffers and functions to transfer data back and forth between host and devices.

ytgui commented 5 years ago

CUDA

Overview

Description

shared memory

The CUDA C compiler treats variables in shared memory differently than typical variables. It creates a copy of the variable for each block that you launch on the GPU.
Every thread in that block shares the memory, but threads cannot see or modify the copy of this variable that is seen within other blocks.
This provides an excellent means by which threads within a block can communicate and collaborate on computations.
Furthermore, shared memory buffers reside physically on the GPU as opposed to residing in off-chip DRAM. Because of this, the latency to access shared memory tends to be far lower than typical buffers, making shared memory effective as a per-block, software managed cache or scratchpad.

constant memory

There are so many ALUs on graphics processors that sometimes we just can’t keep the input coming to them fast enough to sustain such high rates of computation.
Reduce the amount of memory traffic required for a given problem.
NVIDIA hardware provides 64KB of constant memory that it treats differently than it treats standard global memory. In some situations, using constant memory rather than global memory will reduce the required memory bandwidth.

texture memory

Read only memory used by programs in CUDA
Used in General Purpose Computing for Accuracy and Efficiency.
Designed for DirectX and OpenGL rendering Pipelines.

more Texture memory is optimized for 2D spatial locality (where it gets its name from). You can kind of think of constant memory as taking advantage of temperal locality.

The benefits of texture memory over constant memory can be summarized as follows:

Spatial locality.
The addressing calculations can be calculated outside of the kernel in the hardware.
Data can be accessed by different variables in a single operation.
8 bit and 16 bit data can be automatically converted to floating point numbers between 0 and 1.

more 2

Constant memory is optimized for broadcast, i.e. when the threads in a warp all read the same memory location. If they are reading different locations, it will work, but each different location referenced by a warp costs more time. When a read is being broadcast to the threads, constant memory is MUCH faster than texture memory.
Texture memory has high latency, even for cache hits. You can think of it as a bandwidth aggregator - if there's reuse that can be serviced out of the texture cache, the GPU does not have to go out to external memory for those reads. For 2D and 3D textures, the addressing has 2D and 3D locality, so cache line fills pull in 2D and 3D blocks of memory instead of rows.
Finally, the texture pipeline can perform "bonus" calculations: dealing with boundary conditions ("texture addressing") and converting 8- and 16-bit values to unitized float are examples of operations that can be done "for free." (they are part of the reason texture reads have high latency)

ytgui commented 5 years ago

Thread: from a software standpoint a thread is a computation that can be paused and resumed. In principle a thread does not need to have any reflection on the hardware (i.e. one can have threads on a single core CPU). A hardware design can support fast pause and resume of threads by allowing several sets of working registers, one per thread that the scheduler is going to keep in flight. When we talk about the number of GPU threads we mean the maximum number of working registers sets each execution unit provides multiplied by the number of execution units.
Warp: is a set of threads that all share the same code, follow the same execution path with minimal divergences and are expected to stall at the same places. A hardware design can exploit the commonality of the threads belonging to a swarp by combining their memory accesses and assuming that it is fine to pause and resume all the threads at the same time, rather than deciding on a per-thread basis.

https://medium.com/@smallfishbigsea/basic-concepts-in-gpu-computing-3388710e9239 https://www.quora.com/What-is-a-warp-and-how-is-it-different-from-a-thread-block-or-wave-in-CUDA

ytgui commented 5 years ago

// -----
// cuda: grid -> grid -> thread, __shared__
// cl:   kernel -> work_group -> work_item, __local
// -----
// size_t tid = get_local_id(0);
// size_t tid = threadIdx.x;
// -----
// size_t gid = get_global_id(0);
// size_t gid = blockIdx.x * blockDim.x + threadIdx.x;
// -----
// size_t window = get_local_size(0);
// size_t window = blockDim.x;
// -----
// size_t stride = get_global_size(0);
// size_t stride = gridDim.x * blockDim.x;
// -----

ytgui / temp