ytgui / temp

0 stars 0 forks source link

Understand cl / cuda memory architecture #3

Closed ytgui closed 5 years ago

ytgui commented 5 years ago

OpenCL Memory Model

Overview

image image image

Description

OpenCL defines a four-level memory hierarchy for the compute device:

global memory: shared by all processing elements, but has high access latency (__global);
read-only memory: smaller, low latency, writable by the host CPU but not the compute devices (__constant);
local memory: shared by a group of processing elements (__local);
per-element private memory (registers; __private).

Not every device needs to implement each level of this hierarchy in hardware. Consistency between the various levels in the hierarchy is relaxed, and only enforced by explicit synchronization constructs, notably barriers.

Devices may or may not share memory with the host CPU.[13] The host API provides handles on device memory buffers and functions to transfer data back and forth between host and devices.

ytgui commented 5 years ago

CUDA

Overview

image

Description

shared memory

  1. The CUDA C compiler treats variables in shared memory differently than typical variables. It creates a copy of the variable for each block that you launch on the GPU.
  2. Every thread in that block shares the memory, but threads cannot see or modify the copy of this variable that is seen within other blocks.
  3. This provides an excellent means by which threads within a block can communicate and collaborate on computations.
  4. Furthermore, shared memory buffers reside physically on the GPU as opposed to residing in off-chip DRAM. Because of this, the latency to access shared memory tends to be far lower than typical buffers, making shared memory effective as a per-block, software managed cache or scratchpad.

constant memory

  1. There are so many ALUs on graphics processors that sometimes we just can’t keep the input coming to them fast enough to sustain such high rates of computation.
  2. Reduce the amount of memory traffic required for a given problem.
  3. NVIDIA hardware provides 64KB of constant memory that it treats differently than it treats standard global memory. In some situations, using constant memory rather than global memory will reduce the required memory bandwidth.

texture memory

  1. Read only memory used by programs in CUDA
  2. Used in General Purpose Computing for Accuracy and Efficiency.
  3. Designed for DirectX and OpenGL rendering Pipelines.

more Texture memory is optimized for 2D spatial locality (where it gets its name from). You can kind of think of constant memory as taking advantage of temperal locality.

The benefits of texture memory over constant memory can be summarized as follows:

  1. Spatial locality.
  2. The addressing calculations can be calculated outside of the kernel in the hardware.
  3. Data can be accessed by different variables in a single operation.
  4. 8 bit and 16 bit data can be automatically converted to floating point numbers between 0 and 1.

more 2

  1. Constant memory is optimized for broadcast, i.e. when the threads in a warp all read the same memory location. If they are reading different locations, it will work, but each different location referenced by a warp costs more time. When a read is being broadcast to the threads, constant memory is MUCH faster than texture memory.
  2. Texture memory has high latency, even for cache hits. You can think of it as a bandwidth aggregator - if there's reuse that can be serviced out of the texture cache, the GPU does not have to go out to external memory for those reads. For 2D and 3D textures, the addressing has 2D and 3D locality, so cache line fills pull in 2D and 3D blocks of memory instead of rows.
  3. Finally, the texture pipeline can perform "bonus" calculations: dealing with boundary conditions ("texture addressing") and converting 8- and 16-bit values to unitized float are examples of operations that can be done "for free." (they are part of the reason texture reads have high latency)
ytgui commented 5 years ago

https://medium.com/@smallfishbigsea/basic-concepts-in-gpu-computing-3388710e9239 https://www.quora.com/What-is-a-warp-and-how-is-it-different-from-a-thread-block-or-wave-in-CUDA

ytgui commented 5 years ago
// -----
// cuda: grid -> grid -> thread, __shared__
// cl:   kernel -> work_group -> work_item, __local
// -----
// size_t tid = get_local_id(0);
// size_t tid = threadIdx.x;
// -----
// size_t gid = get_global_id(0);
// size_t gid = blockIdx.x * blockDim.x + threadIdx.x;
// -----
// size_t window = get_local_size(0);
// size_t window = blockDim.x;
// -----
// size_t stride = get_global_size(0);
// size_t stride = gridDim.x * blockDim.x;
// -----