yuenshome / yuenshome.github.io

https://yuenshome.github.io
MIT License
81 stars 15 forks source link

armv8.2 cpu conv #110

Open ysh329 opened 4 years ago

ysh329 commented 4 years ago

/source/backend/arm82

版本:0df31a8667bdfdbdea084eef43b6812897e75db9,release 1.0.0 日期:Thu May 7 18:19:02 2020

Arm82Convolution.cpp
Arm82Convolution.hpp
Arm82Convolution3x3.cpp
Arm82Convolution3x3.hpp
Arm82ConvolutionDepthwise.cpp
Arm82ConvolutionDepthwise.hpp

Arm82Convolution.hpp

其中Arm82Convolution.cpp及其hpp包含了Conv的注册和方法,简单摘一下主要方法,实现省略:

class Arm82Convolution : public Execution {
public:
    Arm82Convolution(const MNN::Convolution2D *convParam, Backend *bn);
    virtual ~Arm82Convolution();
    virtual ErrorCode onResize(const std::vector<Tensor *> &inputs, const std::vector<Tensor *> &outputs) override;
    virtual ErrorCode onExecute(const std::vector<Tensor *> &inputs, const std::vector<Tensor *> &outputs) override;

private:
    // plane tile number
    int mTileCount;
    int mThreadNums;
    bool mRelu;
    bool mRelu6;
    CPUConvolution::Im2ColParameter mIm2ColParamter;
    std::shared_ptr<Tensor> mWeightFp16;
    std::shared_ptr<Tensor> mBiasFp16;

    Tensor mIm2ColBuffer;
    Tensor mRemainBuffer;
    const Convolution2DCommon *mCommon;
};

Arm82Convolution.cpp

class Arm82ConvolutionCreator : public Arm82Backend::Arm82Creator {
    virtual Execution *onCreate(const std::vector<Tensor *> &inputs, const std::vector<Tensor *> &outputs,
                                const MNN::Op *op, Backend *backend) const override {
        auto convParam = op->main_as_Convolution2D();
        // avoid other quantize method entry this creator
        if(convParam->quanParameter() && convParam->quanParameter()->type() != 3){
            return nullptr;
        }

#ifdef __aarch64__
        const auto param = convParam->common();
        if (param->kernelX() == 3 && param->kernelY() == 3 && param->strideX() == 1 && param->strideY() == 1 &&
            param->dilateX() == 1 && param->dilateY() == 1) {
            return new Arm82Convolution3x3(convParam, backend);
        }
#endif
        return new Arm82Convolution(convParam, backend);
    }
};

REGISTER_ARM82_OP_CREATOR(OpType_Convolution, Arm82ConvolutionCreator);
ysh329 commented 4 years ago

armv8.2卷积的两个Op

  1. OpType_Convolution的注册位于Arm82Convolution.cpp
  2. OpType_ConvolutionDepthwise的注册位于Arm82ConvolutionDepthwise.cpp
REGISTER_ARM82_OP_CREATOR(OpType_Convolution, Arm82ConvolutionCreator);
REGISTER_ARM82_OP_CREATOR(OpType_ConvolutionDepthwise, Arm82ConvolutionDepthwiseCreator);

OpType_Convolution

armv8.2的cpu针对OpType_Convolution有两种方法实现:

  1. Arm82Convolution3x3:要求kernel为3x3且stride为1且dilation为1时走该方法,该类实现见Arm82Convolution3x3.cpp
  2. Arm82Convolution:除了以上的其它情况,走默认的卷积是实现方法,即im2col+gemm,也是Arm82Convolution.cpp里的class Arm82Convolution所实现的。

Arm82Convolution3x3

实现位于Arm82Convolution3x3.cpp,其实现为Winograd。

  1. input/filter/output的变换以neon instrinic实现(kernelTransform_wino_4x4_3x3 / sourceTransform_wino_4x4_3x3 / dstTransform_wino_4x4_3x3);
  2. 计算kernel有对应的汇编代码位于source/backend/arm82/asm/arm64/,包括不限于:
    1. MNNShuffleChannelC8.S;
    2. MNNGemmFP16C8_UNIT.S。
ysh329 commented 4 years ago

MNN的每个kernel实现上都有最基本的四个方法,如class Arm82Convolution3x3 : public Execution

  1. Arm82Convolution3x3,构造;
  2. ~Arm82Convolution3x3,析构;
  3. onResize,用于计算维度变化及其相关的;
  4. onExecute,实际计算执行。
ysh329 commented 4 years ago

OpType_ConvolutionDepthwise

Arm82ConvolutionDepthwise

Arm82ConvolutionDepthwise.cpp

Arm82ConvolutionDepthwise的构造过程中有个方法MNNQuantizeFP16对权重做FP16的转换处理,其实现有多种实现,包括汇编实现,位于asm/arm64/MNNQuantizeFP16_UNIT4.S

OnResize里定义且实现,但没有调用的两个方法runBasicmThreadFunction,分别会调用以下方法:

  1. MNNLineDepthWiseFp16C8Unit,不仅有纯C++实现(当没有开启时,即#ifndef MNN_USE_NEON,应该是为了做验证),也有汇编实现位于asm/arm64/MNNLineDepthWiseFp16C8Unit.S
  2. MNNDepthWiseFp16C8Unit,仅有一种实现,基于C++和intrinsic混合实现,如fp16的vld1q_f16vfmaq_f16等。

onExecute的计算会调用mThreadFunction方法,mThreadFunction又会调用runBaisc,来做计算。


Arm8.2的主要优势FP16 extensions和Dot Product可以分别应用于浮点计算加速和量化计算加速。

MNN应该是针对上面两个写了对应的汇编实现,对应MNNLineDepthWiseFp16C8Unit和asm/arm64/MNNQuantizeFP16_UNIT4.S

ysh329 commented 4 years ago

ArmComputeLibrary

ARM CPU FP16 Targets · Issue #704 · ARM-software/ComputeLibrary https://github.com/ARM-software/ComputeLibrary/issues/704

armv8.2的cpu fp16实现卷积位于ComputeLibrary/NEConvolutionLayer.h,声明代码见arm_compute/runtime/NEON/functions/NEConvolutionLayer.h

见部分注释对某个方法input的描述

    /** Set the input and output tensors.
     *
     * @param[in]  input            Source tensor. 3 lower dimensions represent a single input [width, height, IFM],
     *                              while every optional dimension from 4 and above represent a batch of inputs.
     *                              Data types supported: QASYMM8/QASYMM8_SIGNED/F16/F32.

实现代码见[src/runtime/NEON/functions/NEConvolutionLayer.cpp]()

在其构建脚本里也能看出端倪SConstruct

# Add architecture specific flags
prefix = ""
if 'v7a' in env['arch']:
    env.Append(CXXFLAGS = ['-march=armv7-a', '-mthumb', '-mfpu=neon'])
    if env['os'] == 'android':
        env.Append(CXXFLAGS = ['-mfloat-abi=softfp'])
    else:
        env.Append(CXXFLAGS = ['-mfloat-abi=hard'])
elif 'v8' in env['arch']:
    if 'sve' in env['arch']:
        env.Append(CXXFLAGS = ['-march=armv8.2-a+sve+fp16+dotprod'])
    elif 'v8.2-a' in env['arch']:
        env.Append(CXXFLAGS = ['-march=armv8.2-a+fp16']) # explicitly enable fp16 extension otherwise __ARM_FEATURE_FP16_VECTOR_ARITHMETIC is undefined
    else:

depthwise

src/runtime/NEON/functions/assembly/NEDepthwiseConvolutionAssemblyDispatch.cpp:

#ifdef __ARM_FEATURE_FP16_VECTOR_ARITHMETIC
std::unique_ptr<depthwise::IDepthwiseConvolution> get_fp16_convolver(int kernel_size, int stride_x,
                                                                     int n_batches, int in_rows, int in_cols, int n_channels,
                                                                     int dilation_factor, neon_convolution_kernels::ActivationFunction activation,
                                                                     int padding_top, int padding_left, int padding_bottom, int padding_right)
{
    switch(kernel_size)
    {
        case 3:
        {
            switch(stride_x)
            {
                case 1:
                    return arm_compute::support::cpp14::make_unique<depthwise::DilatedDepthwiseConvolution<3, 3, 3, 3, 1, 1, float16_t, float16_t, float16_t>>(
                               n_batches, in_rows, in_cols, n_channels, dilation_factor, activation, padding_top, padding_left, padding_bottom, padding_right);
                case 2:
                    return arm_compute::support::cpp14::make_unique<depthwise::DilatedDepthwiseConvolution<3, 3, 3, 3, 2, 2, float16_t, float16_t, float16_t>>(
                               n_batches, in_rows, in_cols, n_channels, dilation_factor, activation, padding_top, padding_left, padding_bottom, padding_right);
                default:
                    return nullptr;
            }
        }
        case 5:
        {
            switch(stride_x)
            {
                case 1:
                    return arm_compute::support::cpp14::make_unique<depthwise::DilatedDepthwiseConvolution<3, 3, 5, 5, 1, 1, float16_t, float16_t, float16_t>>(
                               n_batches, in_rows, in_cols, n_channels, dilation_factor, activation, padding_top, padding_left, padding_bottom, padding_right);
                case 2:
                    return arm_compute::support::cpp14::make_unique<depthwise::DilatedDepthwiseConvolution<3, 3, 5, 5, 2, 2, float16_t, float16_t, float16_t>>(
                               n_batches, in_rows, in_cols, n_channels, dilation_factor, activation, padding_top, padding_left, padding_bottom, padding_right);
                default:
                    return nullptr;
            }
        }
        default:
            return nullptr;
    }
}
#endif // __ARM_FEATURE_FP16_VECTOR_ARITHMETIC
ysh329 commented 4 years ago

armv8,acl和mnn

  1. conv1x1方法,以及gemm性能;
  2. dw3x3方法,性能

MNN

conv1x1

ACL

偶然看到在src/runtime/NEON/functions/NEConvolution.cpp这个文件有include一个文件,名为arm_compute/core/NEON/kernels/NEConvolutionKernel.h