xboot / libonnx

A lightweight, portable pure C99 onnx inference engine for embedded devices with hardware acceleration support.
MIT License
575 stars 107 forks source link

[Conv] cache friendly optimization #3

Closed ReinForce-II closed 3 years ago

ReinForce-II commented 3 years ago

share data across kernels.

ReinForce-II commented 3 years ago

benchmark: super_resolution_10

5900X: before: Constant-10 default 10 0.000(us) Conv-10 default 20 1496819.700(us) Relu-10 default 15 964.000(us) Reshape-10 default 10 124.800(us) Transpose-10 default 5 6619.800(us)

after: Constant-10 default 10 0.200(us) Conv-10 default 20 692877.650(us) Relu-10 default 15 795.733(us) Reshape-10 default 10 129.200(us) Transpose-10 default 5 6785.200(us)

M1: before: Constant-10 default 10 0.000(us) Conv-10 default 20 650797.950(us) Relu-10 default 15 411.600(us) Reshape-10 default 10 37.600(us) Transpose-10 default 5 3822.400(us)

after: Profiler analysis: Constant-10 default 10 0.000(us) Conv-10 default 20 640919.200(us) Relu-10 default 15 368.800(us) Reshape-10 default 10 37.200(us) Transpose-10 default 5 3711.400(us)

RK3399: A72: before: Constant-10 default 2 0.500(us) Conv-10 default 4 4831419.250(us) Relu-10 default 3 2834.000(us) Reshape-10 default 2 504.000(us) Transpose-10 default 1 18659.000(us)

after: Constant-10 default 2 1.000(us) Conv-10 default 4 2608700.000(us) Relu-10 default 3 2887.333(us) Reshape-10 default 2 525.000(us) Transpose-10 default 1 18725.000(us)

A53: before: Constant-10 default 2 1.000(us) Conv-10 default 4 15421956.500(us) Relu-10 default 3 6136.667(us) Reshape-10 default 2 971.000(us) Transpose-10 default 1 47123.000(us)

after: Constant-10 default 2 0.500(us) Conv-10 default 4 7815257.500(us) Relu-10 default 3 6194.667(us) Reshape-10 default 2 964.500(us) Transpose-10 default 1 47420.000(us)