yuenshome / yuenshome.github.io

https://yuenshome.github.io
MIT License
84 stars 15 forks source link

mace对模型在opencl上的调优策略 #107

Open ysh329 opened 4 years ago

ysh329 commented 4 years ago

导读:MACE是小米的端测深度学习推理框架,支持arm cpu、adreno gpu、mali gpu,比较有特色的是gpu(opencl)和对高通dsp方面的支持。最近发布了release v0.13.0版,跟进tflite,也低调地支持了tinyML,下面来自release的一段节选。

Mace adds micro-controller support to fully support ultra-low-power inference scenarios of mobile phones and IoT devices. Mace's micro-controller engine does not rely on any OS, heap memory allocation, C++ library or other third-party libraries except the math library.

有点跑题了,从mace的文档可以学到不少东西,其中在Advanced usage for Bazel users或者Advanced usage for CMake users章节,有一小节写的是Tuning for specific SoC's GPU,来看看他们是怎么做的。

ysh329 commented 4 years ago

Tuning for specific SoC's GPU

针对某个具体型号的GPU进行调优后,文档中说一般来说可以得到1~10%的性能提升。调优使用--tune选项执行,发现cmake的高级使用里有这一项,但是bazel的对应小节没有,感觉bazel对应的Tuning for specific SoC's GPU小节,是不是写的有问题,后接的是validate参数,看起来只是与tensorflow验证结果正确性,并没有做tuning且bazel log中显示* Run 'mobilenet_v1' with round=1, restart_round=1, tuning=False

感觉还是CMake的tuning文档靠谱,因而执行下面bazel命令。

set -e
yml_path=$1
echo "###########################################"
echo "yml_path:${yml_path}"
echo "###########################################"
sleep 1
##############
# bazel convert
##############
bazel clean --expunge
python tools/converter.py convert --config=${yml_path}
##############
# bazel tune
##############
python tools/converter.py \
    run \
    --config=${yml_path} \
    --target_socs="msmnile,sdm845,msm8998,msm8953" \
    --vlog_level=0 \
    --round=100

多说一点,在执行过程中无论是bazel还是CMake都会出现执行命令失败的情况,因为执行过程会提示CMD显示执行的ADB命令,照着执行发现是手机上缺少模型的pb和data文件,挨个手动adb push后再次执行就好了

=================

该命令执行结束后,会在对应的build/mobilenet_v1/opencl目录下得到OpenCL调优结果的参数。

└── mobilenet_v1_tuned_opencl_parameter.MIX2S.sdm845.bin

参数中包含你的测试模型以及SOC信息,在之后部署到生产环境时加载该文件可以得到调优带来的性能增益。后面的部署略过,具体见文档,重点是想看一下这里调优哪些参数,又是怎么做的1~10%的收益。

ysh329 commented 4 years ago

同样,本次因为网络问题,下载代码仓库来自MACE: Mobile AI Compute Engine (MACE) 是一个专为移动端异构计算平台优化的神经网络计算框架 https://gitee.com/mirrors/MACE

ysh329 commented 4 years ago

CMD> bazel version
WARNING: --batch mode is deprecated. Please instead explicitly shut down your Bazel server using the command "bazel shutdown".
Build label: 0.16.0
Build target: bazel-out/k8-opt/bin/src/main/java/com/google/devtools/build/lib/bazel/BazelServer_deploy.jar
Build time: Tue Jul 31 17:01:24 2018 (1533056484)
Build timestamp: 1533056484
Build timestamp as int: 1533056484

CMD> bazel build //mace/proto:mace_py
WARNING: --batch mode is deprecated. Please instead explicitly shut down your Bazel server using the command "bazel shutdown".
Loading:
Loading: 0 packages loaded
Analyzing: target //mace/proto:mace_py (1 packages loaded)
Analyzing: target //mace/proto:mace_py (3 packages loaded)
Analyzing: target //mace/proto:mace_py (6 packages loaded)
Analyzing: target //mace/proto:mace_py (12 packages loaded)
INFO: Analysed target //mace/proto:mace_py (17 packages loaded).
INFO: Found 1 target...
[1 / 12] [-----] BazelWorkspaceStatusAction stable-status.txt
[115 / 127] no action
Target //mace/proto:mace_py up-to-date:
bazel-genfiles/mace/proto/mace_pb2.py
INFO: Elapsed time: 8.321s, Critical Path: 0.35s
INFO: 0 processes.
INFO: Build completed successfully, 1 total action
INFO: Build completed successfully, 1 total action

CMD> cp -f bazel-genfiles/mace/proto/mace_pb2.py /home/yuens/code/mace/tools/python/py_proto

CMD> bazel build //mace/proto:micro_mem_py
WARNING: --batch mode is deprecated. Please instead explicitly shut down your Bazel server using the command "bazel shutdown".
Loading:
Loading: 0 packages loaded
Analyzing: target //mace/proto:micro_mem_py (1 packages loaded)
Analyzing: target //mace/proto:micro_mem_py (3 packages loaded)
Analyzing: target //mace/proto:micro_mem_py (6 packages loaded)
Analyzing: target //mace/proto:micro_mem_py (12 packages loaded)
INFO: Analysed target //mace/proto:micro_mem_py (17 packages loaded).
INFO: Found 1 target...
[0 / 2] [-----] BazelWorkspaceStatusAction stable-status.txt
[110 / 122] no action
Target //mace/proto:micro_mem_py up-to-date:
bazel-genfiles/mace/proto/micro_mem_pb2.py
INFO: Elapsed time: 8.382s, Critical Path: 0.34s
INFO: 0 processes.
INFO: Build completed successfully, 1 total action
INFO: Build completed successfully, 1 total action

CMD> cp -f bazel-genfiles/mace/proto/micro_mem_pb2.py /home/yuens/code/mace/tools/python/py_proto

CMD> bazel build //third_party/caffe:caffe_py
WARNING: --batch mode is deprecated. Please instead explicitly shut down your Bazel server using the command "bazel shutdown".
Loading:
Loading: 0 packages loaded
Analyzing: target //third_party/caffe:caffe_py (1 packages loaded)
Analyzing: target //third_party/caffe:caffe_py (3 packages loaded)
Analyzing: target //third_party/caffe:caffe_py (6 packages loaded)

Analyzing: target //third_party/caffe:caffe_py (12 packages loaded)
INFO: Analysed target //third_party/caffe:caffe_py (17 packages loaded).
INFO: Found 1 target...
[0 / 4] [-----] BazelWorkspaceStatusAction stable-status.txt
[109 / 121] no action
Target //third_party/caffe:caffe_py up-to-date:
bazel-genfiles/third_party/caffe/caffe_pb2.py
INFO: Elapsed time: 8.379s, Critical Path: 0.29s
INFO: 0 processes.
INFO: Build completed successfully, 1 total action
INFO: Build completed successfully, 1 total action

CMD> cp -f bazel-genfiles/third_party/caffe/caffe_pb2.py /home/yuens/code/mace/tools/python/py_proto

CMD> adb devices
List of devices attached
CUY0219604010390        device

tools/python/run_model.py:73: Run on devices: ['CUY0219604010390']
tools/python/run_model.py:84: Run model mobilenet_v1
{'output_data_types': [1], 'model_sha256_checksum': '71b10f540ece33c49a7b51f5d4095fc9bd78ce46ebf0300487b2ee23d71294e6', 'input_tensors': ['input'], 'data_type': 3, 'limit_opencl_kernel_time': 0, 'input_data_formats': [<DataForm
at.NHWC: 1>], 'output_shapes': [[1, 1001]], 'winograd': 0, 'nnlib_graph_mode': 0, 'input_shapes': [[1, 224, 224, 3]], 'platform': <Platform.TENSORFLOW: 0>, 'obfuscate': 0, 'output_data_formats': [<DataFormat.NHWC: 1>], 'model_f
ile_path': 'https://cnbj1.fds.api.xiaomi.com/mace/miai-models/mobilenet-v1/mobilenet-v1-1.0.pb', 'output_tensors': ['MobilenetV1/Predictions/Reshape_1'], 'input_data_types': [1], 'validation_inputs_data': ['https://cnbj1.fds.ap
i.xiaomi.com/mace/inputs/dog.npy'], 'runtime': <DeviceType.CPU_GPU: 100>, 'input_ranges': [[-1.0, 1.0]]}
CMD> adb -s CUY0219604010390 shell mkdir -p /data/local/tmp/mace_run/mobilenet_v1/interior

WARNING: tools/python/run_model.py:114: No models exist in build/mobilenet_v1/model/mobilenet_v1.pb, use --model_file and --model_data_file specified in args
/home/yuens/code/mace/tools/python/utils/util.py:156: Downloading file https://cnbj1.fds.api.xiaomi.com/mace/inputs/dog.npy to /tmp/tmpqiXokE/mobilenet_v1_input, please wait ...
/home/yuens/code/mace/tools/python/utils/util.py:166: Model downloaded successfully.
CMD> adb -s CUY0219604010390 shell mkdir -p /data/local/tmp/mace_run/mobilenet_v1/validate_in

CMD> adb -s CUY0219604010390 push /tmp/tmpqiXokE/* /data/local/tmp/mace_run/mobilenet_v1/validate_in
CMD> adb -s CUY0219604010390 shell mkdir -p /data/local/tmp/mace_run/mobilenet_v1/validate_out

CMD> adb devices
List of devices attached
CUY0219604010390        device

Run on devices: ['CUY0219604010390']
Install target from build/cmake-build/armeabi-v7a/install/bin/mace_run to /data/local/tmp/mace_run/mobilenet_v1
CMD> adb -s CUY0219604010390 shell mkdir -p /data/local/tmp/mace_run/mobilenet_v1

CMD> adb -s CUY0219604010390 push build/cmake-build/armeabi-v7a/install/bin/mace_run /data/local/tmp/mace_run/mobilenet_v1
CMD> /opt/android-ndk-r15c/ndk-depends build/cmake-build/armeabi-v7a/install/bin/mace_run
WARNING: Could not find library: libc++_shared.so
mace_run
libm.so
liblog.so
libdl.so
libc.so
libc++_shared.so

CMD> adb -s CUY0219604010390 push /opt/android-ndk-r15c/sources/cxx-stl/llvm-libc++/libs/armeabi-v7a/libc++_shared.so /data/local/tmp/mace_run/mobilenet_v1
['', 'MACE_INTERNAL_STORAGE_PATH=/data/local/tmp/mace_run/mobilenet_v1/interior', 'MACE_TUNING=1', 'MACE_RUN_PARAMETER_PATH=/data/local/tmp/mace_run/mobilenet_v1/interior/tune_params', 'LD_LIBRARY_PATH=/data/local/tmp/mace_run/
mobilenet_v1', '/data/local/tmp/mace_run/mobilenet_v1/mace_run', '--device=GPU', '--input_data_format=NHWC', '--output_node=MobilenetV1/Predictions/Reshape_1', '--output_shape=1,1001', '--model_data_file=/data/local/tmp/mace_ru
n/mobilenet_v1/mobilenet_v1.data', '--input_node=input', '--model_file=/data/local/tmp/mace_run/mobilenet_v1/mobilenet_v1.pb', '--model_name=mobilenet_v1', '--output_data_format=NHWC', '--input_shape=1,224,224,3', '--input_file
=/data/local/tmp/mace_run/mobilenet_v1/validate_in/mobilenet_v1', '--output_file=/data/local/tmp/mace_run/mobilenet_v1/validate_out/mobilenet_v1', '--round=0']

 MACE_INTERNAL_STORAGE_PATH=/data/local/tmp/mace_run/mobilenet_v1/interior MACE_TUNING=1 MACE_RUN_PARAMETER_PATH=/data/local/tmp/mace_run/mobilenet_v1/interior/tune_params LD_LIBRARY_PATH=/data/local/tmp/mace_run/mobilenet_v1 /
data/local/tmp/mace_run/mobilenet_v1/mace_run --device=GPU --input_data_format=NHWC --output_node=MobilenetV1/Predictions/Reshape_1 --output_shape=1,1001 --model_data_file=/data/local/tmp/mace_run/mobilenet_v1/mobilenet_v1.data
 --input_node=input --model_file=/data/local/tmp/mace_run/mobilenet_v1/mobilenet_v1.pb --model_name=mobilenet_v1 --output_data_format=NHWC --input_shape=1,224,224,3 --input_file=/data/local/tmp/mace_run/mobilenet_v1/validate_in
/mobilenet_v1 --output_file=/data/local/tmp/mace_run/mobilenet_v1/validate_out/mobilenet_v1 --round=0
Runing ...
['', 'MACE_INTERNAL_STORAGE_PATH=/data/local/tmp/mace_run/mobilenet_v1/interior', 'MACE_TUNING=1', 'MACE_RUN_PARAMETER_PATH=/data/local/tmp/mace_run/mobilenet_v1/interior/tune_params', 'LD_LIBRARY_PATH=/data/local/tmp/mace_run/
mobilenet_v1', '/data/local/tmp/mace_run/mobilenet_v1/mace_run', '--device=GPU', '--input_data_format=NHWC', '--output_node=MobilenetV1/Predictions/Reshape_1', '--output_shape=1,1001', '--model_data_file=/data/local/tmp/mace_ru
n/mobilenet_v1/mobilenet_v1.data', '--input_node=input', '--model_file=/data/local/tmp/mace_run/mobilenet_v1/mobilenet_v1.pb', '--model_name=mobilenet_v1', '--output_data_format=NHWC', '--input_shape=1,224,224,3', '--input_file
=/data/local/tmp/mace_run/mobilenet_v1/validate_in/mobilenet_v1', '--output_file=/data/local/tmp/mace_run/mobilenet_v1/validate_out/mobilenet_v1', '--round=0']
CMD> adb -s CUY0219604010390 push /tmp/tmprA6N4N/cmd.sh /data/local/tmp/mace_run/mobilenet_v1
18 KB/s (782 bytes in 0.040s)

CMD> adb -s CUY0219604010390 shell sh /data/local/tmp/mace_run/mobilenet_v1/cmd.sh
I /home/yuens/code/mace/mace/tools/mace_run.cc:530] model name: mobilenet_v1
I /home/yuens/code/mace/mace/tools/mace_run.cc:531] mace version: v0.13.0-15-gd763bc2
I /home/yuens/code/mace/mace/tools/mace_run.cc:532] input node: input
I /home/yuens/code/mace/mace/tools/mace_run.cc:533] input shape: 1,224,224,3
I /home/yuens/code/mace/mace/tools/mace_run.cc:534] output node: MobilenetV1/Predictions/Reshape_1
I /home/yuens/code/mace/mace/tools/mace_run.cc:535] output shape: 1,1001
I /home/yuens/code/mace/mace/tools/mace_run.cc:536] input_file: /data/local/tmp/mace_run/mobilenet_v1/validate_in/mobilenet_v1
I /home/yuens/code/mace/mace/tools/mace_run.cc:537] output_file: /data/local/tmp/mace_run/mobilenet_v1/validate_out/mobilenet_v1
I /home/yuens/code/mace/mace/tools/mace_run.cc:538] input dir:
I /home/yuens/code/mace/mace/tools/mace_run.cc:539] output dir: output
I /home/yuens/code/mace/mace/tools/mace_run.cc:540] model_data_file: /data/local/tmp/mace_run/mobilenet_v1/mobilenet_v1.data
I /home/yuens/code/mace/mace/tools/mace_run.cc:541] model_file: /data/local/tmp/mace_run/mobilenet_v1/mobilenet_v1.pb
I /home/yuens/code/mace/mace/tools/mace_run.cc:542] device: GPU
I /home/yuens/code/mace/mace/tools/mace_run.cc:543] round: 0
I /home/yuens/code/mace/mace/tools/mace_run.cc:544] restart_round: 1
I /home/yuens/code/mace/mace/tools/mace_run.cc:545] gpu_perf_hint: 3
I /home/yuens/code/mace/mace/tools/mace_run.cc:546] gpu_priority_hint: 3
I /home/yuens/code/mace/mace/tools/mace_run.cc:547] omp_num_threads: -1
I /home/yuens/code/mace/mace/tools/mace_run.cc:548] cpu_affinity_policy: 1
I /home/yuens/code/mace/mace/libmace/mace.cc:506] Creating MaceEngine, MACE version: v0.13.0-15-gd763bc2
I /home/yuens/code/mace/mace/libmace/mace.cc:561] Initializing MaceEngine
I /home/yuens/code/mace/mace/libmace/mace.cc:710] Destroying MaceEngine
I /home/yuens/code/mace/mace/tools/mace_run.cc:599] restart round 0
I /home/yuens/code/mace/mace/libmace/mace.cc:1024] Create MaceEngine from model graph proto and weights data
I /home/yuens/code/mace/mace/libmace/mace.cc:506] Creating MaceEngine, MACE version: v0.13.0-15-gd763bc2
I /home/yuens/code/mace/mace/libmace/mace.cc:561] Initializing MaceEngine
I /home/yuens/code/mace/mace/tools/mace_run.cc:272] Create Mace Engine latency: 106.116 ms
I /home/yuens/code/mace/mace/tools/mace_run.cc:279] Total init latency: 106.238 ms
I /home/yuens/code/mace/mace/tools/mace_run.cc:373] Warm up run
I /home/yuens/code/mace/mace/tools/mace_run.cc:409] 1st warm up run latency: 4405.56 ms
I /home/yuens/code/mace/mace/tools/mace_run.cc:494] Write output file /data/local/tmp/mace_run/mobilenet_v1/validate_out/mobilenet_v1_MobilenetV1_Predictions_Reshape_1 with size 4004 done.
========================================================
capability(CPU)        init      warmup     run_avg
========================================================
time          15.995     106.238    4405.564      -1.000
I /home/yuens/code/mace/mace/libmace/mace.cc:710] Destroying MaceEngine

CMD> adb -s CUY0219604010390 shell getprop
CMD> adb -s CUY0219604010390 shell getprop
CMD> adb -s CUY0219604010390 pull /data/local/tmp/mace_run/mobilenet_v1/interior/mace_cl_compiled_program.bin build/mobilenet_v1/opencl/mobilenet_v1_compiled_opencl_kernel.YAL-AL00.kirin980.bin
CMD> adb -s CUY0219604010390 shell getprop
CMD> adb -s CUY0219604010390 shell getprop
CMD> adb -s CUY0219604010390 pull /data/local/tmp/mace_run/mobilenet_v1/interior/tune_params build/mobilenet_v1/opencl/mobilenet_v1_tuned_opencl_parameter.YAL-AL00.kirin980.bin
ysh329 commented 4 years ago

mace

gpu_perf_hint: 3 gpu_priority_hint = 3 cpu_affinity_policy: 1 omp_num_threads: -1

    * - --omp_num_threads
      - int
      - -1
      - ``run``
      - number of threads
    * - --cpu_affinity_policy
      - int
      - 1
      - ``run``
      - 0:AFFINITY_NONE/1:AFFINITY_BIG_ONLY/2:AFFINITY_LITTLE_ONLY
    * - --gpu_perf_hint
      - int
      - 3
      - ``run``
      - 0:DEFAULT/1:LOW/2:NORMAL/3:HIGH
    * - --gpu_priority_hint
      - int
      - 3
      - ``run``/``benchmark``
      - 0:DEFAULT/1:LOW/2:NORMAL/3:HIGH
ysh329 commented 4 years ago

mace benchmark/armv7

caffe-mobilenetv1

手机 serialno soc gpu un-tuned(ms) tuned(ms)
Redmi6Pro / 625 577396399905 msm8953 adreno506 187.0 101.9
MI9 / 855 17c3cc34 msmnile adreno640 53.4 9.84
MI8 / 845 7f1446bd sdm845 adreno630 28.5 14.3
NX563J / 835 93fe992b msm8998 adreno540 28.3 13.2
YAL-AL00 / 980 CUY0219604010390 kirn980 G76 17.6
ALP-TL00 / 970 NDF7N18818005475 kirin970 G72 30.4

caffe-mobilenetv2

手机 serialno soc gpu un-tuned(ms) tuned(ms)
Redmi6Pro / 625 577396399905 msm8953 adreno506 85.6
MI9 / 855 17c3cc34 msmnile adreno640 113.0 9.8
MI8 / 845 7f1446bd sdm845 adreno630 38.4 11.7
NX563J / 835 93fe992b msm8998 adreno540 42.8 11.9
YAL-AL00 / 980 CUY0219604010390 kirn980 G76 20.3
ALP-TL00 / 970 NDF7N18818005475 kirin970 G72 23.0

tuned

mace/build/caffe_mobilenetv1# tree
.
|-- model
|   |-- caffe_mobilenetv1.data
|   |-- caffe_mobilenetv1.pb
|   |-- caffe_mobilenetv1.pb_txt
|   `-- caffe_mobilenetv1_index.html
|-- opencl
|   |-- caffe_mobilenetv1_compiled_opencl_kernel.MI8.sdm845.bin
|   |-- caffe_mobilenetv1_compiled_opencl_kernel.MI9.msmnile.bin
|   |-- caffe_mobilenetv1_compiled_opencl_kernel.NX563J.msm8998.bin
|   |-- caffe_mobilenetv1_tuned_opencl_parameter.MI8.sdm845.bin
|   |-- caffe_mobilenetv1_tuned_opencl_parameter.MI9.msmnile.bin
|   `-- caffe_mobilenetv1_tuned_opencl_parameter.NX563J.msm8998.bin
`-- org_model
    `-- mobilenet_deploy.prototxt.new.prototxt-68ae854a329ffd99893d54fb734a716c51ac3d08379acc19c1c69a61a20ea24b.pb

3 directories, 11 files
ysh329 commented 4 years ago
mace/utils/tuner.h

mace/core/device_context.h
mace/core/device_context.cc

mace/core/runtime/opencl/gpu_device.h
mace/core/runtime/opencl/gpu_device.cc

mace/core/runtime/opencl/opencl_helper.cc

mace/core/runtime/opencl/opencl_runtime.h
mace/core/runtime/opencl/opencl_runtime.cc