Closed ysh329 closed 7 years ago
According to some simple analysis, you can find the block partition need to greatly utilize the cache resource. Besides, the concrete block size can be searched directly. Due to regardless of for-loop length in inner kernel, the lengths of outer for-loops are casual.
Below is a piece of CPU GEMM(4x4_11) code.