zuoqing1988 / ZQCNN-MTCNN-vs-libfacedetection

对比ZQCNN-MTCNN与libfacedetection
149 stars 44 forks source link

请教arm上的性能问题 #1

Open pxlong opened 5 years ago

pxlong commented 5 years ago

您好,我在PC和arm-cortex-A9 (32bit) 上分别测试了一下这个模型。PC上的效果很赞,但是不知道为何arm上的效果不佳,我怀疑是编译问题还是在arm上需要对网络进行特殊配置?

我使用的arm交叉编译命令为:

$ cmake .. -DCMAKE_BUILD_TYPE=RELEASE -DCMAKE_TOOLCHAIN_FILE=../Toolchain-arm-linux-gnueabihf.cmake -DSIMD_ARCH_TYPE=arm

在arm上运行的结果如下(输入图像为images中的4.jpg,没到找到face):

$ ./SampleMTCNN 4.jpg 
rnet = 0.5 M, onet = 2.1 M
convert cost: 32.968 ms
Pnet [0]: resolution [640x480], resize:0.002 ms, cost:433.564 ms
Pnet [1]: resolution [454x341], resize:45.522 ms, cost:219.581 ms
Pnet [2]: resolution [322x242], resize:23.582 ms, cost:106.983 ms
Pnet [3]: resolution [229x172], resize:14.352 ms, cost:52.163 ms
Pnet [4]: resolution [162x122], resize:8.938 ms, cost:24.974 ms
Pnet [5]: resolution [115x86], resize:4.032 ms, cost:11.189 ms
Pnet [6]: resolution [82x61], resize:3.582 ms, cost:5.784 ms
Pnet [7]: resolution [58x44], resize:1.166 ms, cost:2.544 ms
Pnet [8]: resolution [41x31], resize:0.615 ms, cost:1.205 ms
Pnet [9]: resolution [29x22], resize:0.294 ms, cost:0.560 ms
Pnet [10]: resolution [27x20], resize:0.239 ms, cost:0.457 ms
nms cost: 0.494 ms, (20-->2)
nms cost: 0.254 ms, (17-->0)
nms cost: 0.197 ms, (35-->5)
nms cost: 0.190 ms, (45-->5)
nms cost: 0.123 ms, (37-->4)
nms cost: 0.059 ms, (16-->4)
nms cost: 0.015 ms, (2-->1)
nms cost: 0.004 ms, (0-->0)
nms cost: 0.003 ms, (0-->0)
nms cost: 0.002 ms, (0-->0)
nms cost: 0.011 ms, (2-->1)
nms cost: 0.060 ms
first stage candidate count: 22
stage 1: cost 965.940 ms
run Rnet [12] times, candidate after nms: 0 
stage 2: cost 45.866 ms
run Onet [0] times, candidate before nms: 0 
stage 3: cost 0.013 ms
final found num: 0
total cost: 1044.997 ms (P: 998.948 ms, R: 45.991 ms, O: 0.058 ms)
total 1.045 s / 1 = 1045.050 ms
num face: 0
pxlong commented 5 years ago

编译配置文件如下:

#ifndef _ZQ_CNN_COMPILE_CONFIG_H_
#define _ZQ_CNN_COMPILE_CONFIG_H_
#include <stdlib.h>
#include <stdio.h>
#include <malloc.h>
#include <string.h>

#define ZQ_CNN_SSETYPE_NONE 0
#define ZQ_CNN_SSETYPE_SSE 1
#define ZQ_CNN_SSETYPE_AVX 2
#define ZQ_CNN_SSETYPE_AVX2 3

#if defined(_WIN32)

#define ZQ_DECLSPEC_ALIGN32 __declspec(align(32))
#define ZQ_DECLSPEC_ALIGN16 __declspec(align(16))

// your settings
#define ZQ_CNN_USE_SSETYPE ZQ_CNN_SSETYPE_AVX2
#define ZQ_CNN_USE_BLAS_GEMM 1 // if you want to use openblas, set to 1
#if ZQ_CNN_USE_BLAS_GEMM == 0
#define ZQ_CNN_USE_MKL_GEMM 1
#endif
#if (ZQ_CNN_USE_BLAS_GEMM == 0 && ZQ_CNN_USE_MKL_GEMM == 0)
#define ZQ_CNN_USE_ZQ_GEMM 1
#endif

#if ZQ_CNN_USE_SSETYPE >= ZQ_CNN_SSETYPE_AVX2
#define ZQ_CNN_USE_FMADD128 1 
#define ZQ_CNN_USE_FMADD256 1 
#else
#define ZQ_CNN_USE_FMADD128 0
#define ZQ_CNN_USE_FMADD256 0 
#endif

/**   for linux system      **/
#else //#if !defined(_WIN32)

#define ZQ_DECLSPEC_ALIGN32 __attribute__((aligned(32)))
#define ZQ_DECLSPEC_ALIGN16 __attribute__((aligned(16)))

#if defined(ZQ_CNN_USE_ARM_NEON)
#define __ARM_NEON 1
#else
#define __ARM_NEON 0
#endif

#if defined(ZQ_CNN_USE_ARM_NEON_ARMV8)
#define __ARM_NEON_ARMV8 1
#else
#define __ARM_NEON_ARMV8 0
#endif

#if defined(ZQ_CNN_USE_ARM_NEON_FP16)
#define __ARM_NEON_FP16 1
#else
#define __ARM_NEON_FP16 0
#endif

#if __ARM_NEON
//#define ZQ_CNN_USE_FMADD128 1
#define ZQ_CNN_USE_SSETYPE ZQ_CNN_SSETYPE_NONE
//#if defined(ZQ_CNN_USE_BOTH_BLAS_ZQ_GEMM)
#define ZQ_CNN_USE_ZQ_GEMM 1
//#define ZQ_CNN_USE_BLAS_GEMM 1
//#endif
#else
// your settings
#define ZQ_CNN_USE_SSETYPE ZQ_CNN_SSETYPE_AVX
#define ZQ_CNN_USE_BLAS_GEMM 0 // if you want to use openblas, set to 1
#if ZQ_CNN_USE_BLAS_GEMM == 0
#define ZQ_CNN_USE_MKL_GEMM 0
#endif
#if (ZQ_CNN_USE_BLAS_GEMM == 0 && ZQ_CNN_USE_MKL_GEMM == 0)
#define ZQ_CNN_USE_ZQ_GEMM 1
#endif

#if ZQ_CNN_USE_SSETYPE >= ZQ_CNN_SSETYPE_AVX2
#define ZQ_CNN_USE_FMADD128 1 
#define ZQ_CNN_USE_FMADD256 1 
#else
#define ZQ_CNN_USE_FMADD128 0
#define ZQ_CNN_USE_FMADD256 0 
#endif

#endif //__ARM_NEON

#ifndef __int64 
#define __int64 long long
#endif

#ifndef __min
#define __min(a,b) ((a)<(b)?(a):(b))
#endif

#ifndef __max
#define __max(a,b) ((a)>(b)?(a):(b))
#endif

#ifndef _aligned_malloc
#define _aligned_malloc(x,y) memalign(y,x)
#endif

#ifndef _aligned_free
#define _aligned_free free
#endif

#ifndef fread_s
#define fread_s(a,b,c,d,e) fread(a,c,d,e)
#endif

#endif// defined(WIN32) || defined(_WINDOWS_)

#endif// _ZQ_CNN_COMPILE_CONFIG_H_
zuoqing1988 commented 5 years ago

1.我一个hi6620oem芯片(据说是麒麟910)上成功运行过arm32位程序。 2.我编译arm都是在arm本机编译的。32位用以下命令,不需要改代码的任何部分。 cmake .. -DSIMD_ARCH_TYPE=arm -DBLAS_TYPE=openblas 3.ZQCNN/3rdparty/lib/libopenblas.a是从别处下载来的32位openblas,能否在你机器上发挥最大性能存疑。