Crash at Mobilenet's first run at conv13 layer when only Conv layer was ported

zhaofenqiang commented 6 years ago

Following the installation instuctions, running mobilnet inference on Raspberry Pi. Program output:

zfq@zfq-pi:~/mobile$ ./inference --merged_model ./mobilenet_flowers102.paddle --input_size 150528 I0411 04:58:02.978574 12090 Util.cpp:166] commandline:
Time of init paddle 6.09545 ms. Time of create from merged model file 792.855 ms. image conv_0 running ExpandConvLayer forward... running on acl... batch_norm_0 conv_1 running ExpandConvLayer forward... running on acl... batch_norm_1 conv_2 running ExpandConvLayer forward... running on acl... batch_norm_2 conv_3 running ExpandConvLayer forward... running on acl... batch_norm_3 conv_4 running ExpandConvLayer forward... running on acl... batch_norm_4 conv_5 running ExpandConvLayer forward... running on acl... batch_norm_5 conv_6 running ExpandConvLayer forward... running on acl... batch_norm_6 conv_7 running ExpandConvLayer forward... running on acl... batch_norm_7 conv_8 running ExpandConvLayer forward... running on acl... batch_norm_8 conv_9 running ExpandConvLayer forward... running on acl... batch_norm_9 conv_10 running ExpandConvLayer forward... running on acl... batch_norm_10 conv_11 running ExpandConvLayer forward... running on acl... batch_norm_11 conv_12 running ExpandConvLayer forward... running on acl... batch_norm_12 conv_13 running ExpandConvLayer forward... running on acl... Thread [1995785216] Forwarding __conv_13__, Aborted at 1523437090 (unix time) try "date -d @1523437090" if you are using GNU date PC: @ 0x0 (unknown) SIGSEGV (@0x930000) received by PID 12090 (TID 0x76f54400) from PID 9633792; stack trace: @ 0x767f8270 (unknown) Segmentation fault

zhaofenqiang commented 6 years ago

After gdb inference to debug the program. Found the program was crashed at Line493 https://github.com/zhaofenqiang/PaddleOnACL/blob/345ff226df5640533708ad6a5d23aa7a32eb3803/paddle/gserver/layers/ACLOperator.hpp#L491-L497
The op is NEConvolutionLayer, but the ACL didn't built with debug mode, so the configure function's detail cannot be dug. Moreover, the configure function should have no problem, because other conv layer is configured and executed successfully. So this is very strange. Maybe due to the cpp14's unique pointer at https://github.com/zhaofenqiang/PaddleOnACL/blob/345ff226df5640533708ad6a5d23aa7a32eb3803/paddle/gserver/layers/ACLBaseBaseTensor.hpp#L22-L42 Because I am not very clear about what's it for.

zhaofenqiang commented 6 years ago

Just thought about the group convolution. PaddleOnACL didn't implement group convolution yet, or didn't pay much attention to it, so it could have some potential problems.

zhaofenqiang / PaddleOnACL

Crash at Mobilenet's first run at conv13 layer when only Conv layer was ported #1