Open ytgui opened 4 years ago
Individual threads composing a warp start together at the same program address, but they have their own instruction address counter and register state and are therefore free to branch and execute independently.
A warp executes one common instruction at a time, so full efficiency is realized when all 32 threads of a warp agree on their execution path.
If threads of a warp diverge via a data-dependent conditional branch, the warp executes each branch path taken, disabling threads that are not on that path.
Prior to Volta, warps used a single program counter shared amongst all 32 threads in the warp together with an active mask specifying the active threads of the warp. As a result, threads from the same warp in divergent regions or different states of execution cannot signal each other or exchange data, and algorithms requiring fine-grained sharing of data guarded by locks or mutexes can easily lead to deadlock, depending on which warp the contending threads come from.
Starting with the Volta architecture, Independent Thread Scheduling allows full concurrency between threads, regardless of warp.
https://en.wikipedia.org/wiki/Thread_block_(CUDA_programming)
https://stackoverflow.com/questions/3519598/streaming-multiprocessors-blocks-and-threads-cuda
num of instruction dispatcher
https://docs.nvidia.com/pdf/CUDA_C_Programming_Guide.pdf