Open zhangjun opened 2 years ago
cutlass::conv::Conv2dProblemSize(
{1, 23, 56, 98}, // input size (NHWC)
{128, 3, 3, 98}, // filter size (KRSC)
{4, 0, 5, 0}, // padding (pad_h, _, pad_w, _)
{3, 3}, // stride (stride_h, stride_w)
{1, 1} // dilation (dilation_h, dilation_w)
)
Conv2dProblemSize(
cutlass::Tensor4DCoord input_size, // NHWC
cutlass::Tensor4DCoord filter_size, // KRSC
cutlass::Tensor4DCoord output_size, // NPQK
cutlass::conv::Mode mode = cutlass::conv::Mode::kCrossCorrelation,
int split_k_slices = 1,
int groups = 1
);
// set output P and Q
P = ((H + pad_h * 2 - R * dilation_h) / stride_h) + 1;
Q = ((W + pad_w * 2 - S * dilation_w) / stride_w) + 1;
Optimizing Applications for NVIDIA Ampere GPU Architecture Inside the NVIDIA Ampere Architecture Accelerating Sparsity in the NVIDIA Ampere Architecture
CUTLASS: CUDA TEMPLATE LIBRARY FOR DENSE LINEAR ALGEBRA AT ALL LEVELS AND SCALES cutlass-gtc2018 PROGRAMMING TENSOR CORES: NATIVE VOLTA TENSOR CORES WITH CUTLASS gtc-2019 Tensor Core Performance on NVIDIA GPUs: The Ultimate Guide Developing CUDA Kernels to Push Tensor Cores to the Absolute Limit on NVIDIA A100 cutlass-gtc2020 Accelerating Backward Data Gradient by Increasing Tensor Core Utilization in CUTLASS cutlass-gtc2022 Use CUTLASS to Fuse Multiple GEMMs to Extreme Performance cutlass-gtc2022 中文 Auto48: A General Framework for Automatic Model Compression and Acceleration using Int4/Int8 Mixed Precision cutlass-gtc2022 Large Models are not Always Expensive: Large Scale Mixture of Expert Models with Efficient Inference Empowers Microsoft Translator with Best Models cutlass-gtc2022
https://developer.nvidia.com/blog/cutlass-linear-algebra-cuda/
https://vccvisualization.org/CS380_GPU_and_GPGPU_Programming/