Convolution params: K, C, R, S

R, S: height, width of the filter
C number of input channels
K filters, each produces a single output channel (plane)

cutlass::conv::Conv2dProblemSize(
    {1, 23, 56, 98},     // input size (NHWC)
    {128, 3, 3, 98},     // filter size (KRSC)
    {4, 0, 5, 0},      // padding (pad_h, _, pad_w, _)
    {3, 3},            // stride (stride_h, stride_w)
    {1, 1}             // dilation (dilation_h, dilation_w)
  )

  Conv2dProblemSize(
    cutlass::Tensor4DCoord input_size,    // NHWC
    cutlass::Tensor4DCoord filter_size,   // KRSC
    cutlass::Tensor4DCoord output_size,   // NPQK
    cutlass::conv::Mode mode = cutlass::conv::Mode::kCrossCorrelation,
    int split_k_slices = 1,
    int groups = 1
  );
  // set output P and Q
  P = ((H + pad_h * 2 - R * dilation_h) / stride_h) + 1;
  Q = ((W + pad_w * 2 - S * dilation_w) / stride_w) + 1;

zhangjun commented 1 year ago

gpu resources

Optimizing Applications for NVIDIA Ampere GPU Architecture Inside the NVIDIA Ampere Architecture Accelerating Sparsity in the NVIDIA Ampere Architecture

CUTLASS: CUDA TEMPLATE LIBRARY FOR DENSE LINEAR ALGEBRA AT ALL LEVELS AND SCALES cutlass-gtc2018 PROGRAMMING TENSOR CORES: NATIVE VOLTA TENSOR CORES WITH CUTLASS gtc-2019 Tensor Core Performance on NVIDIA GPUs: The Ultimate Guide Developing CUDA Kernels to Push Tensor Cores to the Absolute Limit on NVIDIA A100 cutlass-gtc2020 Accelerating Backward Data Gradient by Increasing Tensor Core Utilization in CUTLASS cutlass-gtc2022 Use CUTLASS to Fuse Multiple GEMMs to Extreme Performance cutlass-gtc2022 中文 Auto48: A General Framework for Automatic Model Compression and Acceleration using Int4/Int8 Mixed Precision cutlass-gtc2022 Large Models are not Always Expensive: Large Scale Mixture of Expert Models with Efficient Inference Empowers Microsoft Translator with Best Models cutlass-gtc2022

https://developer.nvidia.com/blog/cutlass-linear-algebra-cuda/

zhangjun / zhangjun.github.io

GPU course #27

Convolution params: K, C, R, S

gpu resources