ggml-qnn: refine Android command line UT program

ggml-qnn.cpp should/might be bug-free now(fix a stupid bug --- which could case crash or memory leak (comment a special LoC as a workaround to fix crash) --- in ggml-qnn.cpp)
bug free in ggml-jni(bug free in ggml-jni was scheduled/should be done in the previous commits but I found other bugs during development of this PR. these bugs are both caused by the assert failure in ggml.c. Issue/Bug report are greatly welcomed and appreciated.)
refine the codes for purpose of follow the coding style in upstream more strictly, so update PR in upstream is more easily and quickly
add a more complicated/proven Android command line UT(whisper.cpp using QNN backend in Android command line mode --- without third-party Termux) and it also works very fine as expected on Xiaomi14(Qualcomm SM8650-AB Snapdragon 8 Gen 3) which similar to UT(whisper.cpp using QNN backend) in Android APK before PR in upstream on 04/24/2024.
```
[ggml_qnn_can_handle_op, 2089]: src0      tensor_737: type = 8 ( q8_0) ne =   384 x  1536 x     1, nb = (   34,   408, 626688)
```

[ggml_qnn_can_handle_op, 2094]: src1 tensor_7718: type = 0 ( f32) ne = 384 x 1 x 1, nb = ( 4, 1536, 1536)

[ggml_qnn_can_handle_op, 2100]: tensor_7719: type = 0 ( f32) ne = 1536 x 1 x 1, nb = ( 4, 6144, 6144)

[ggml_qnn_can_handle_op, 2089]: src0 tensor_739: type = 8 ( q8_0) ne = 1536 x 384 x 1, nb = ( 34, 1632, 626688)

[ggml_qnn_can_handle_op, 2094]: src1 tensor_7721: type = 0 ( f32) ne = 1536 x 1 x 1, nb = ( 4, 6144, 6144)

[ggml_qnn_can_handle_op, 2100]: tensor_7722: type = 0 ( f32) ne = 384 x 1 x 1, nb = ( 4, 1536, 1536)

[ggml_qnn_can_handle_op, 2056]: op name:ADD, tensor type:f32 [ggml_qnn_can_handle_op, 2080]: src0 type:f32 [ggml_qnn_can_handle_op, 2081]: src1 type:f32 [ggml_qnn_can_handle_op, 2056]: op name:ADD, tensor type:f32 [ggml_qnn_can_handle_op, 2080]: src0 type:f32 [ggml_qnn_can_handle_op, 2081]: src1 type:f32 [ggml_qnn_can_handle_op, 2056]: op name:NORM, tensor type:f32 [ggml_qnn_can_handle_op, 2056]: op name:MUL, tensor type:f32 [ggml_qnn_can_handle_op, 2080]: src0 type:f32 [ggml_qnn_can_handle_op, 2081]: src1 type:f32 [ggml_qnn_can_handle_op, 2056]: op name:ADD, tensor type:f32 [ggml_qnn_can_handle_op, 2080]: src0 type:f32 [ggml_qnn_can_handle_op, 2081]: src1 type:f32 [ggml_qnn_can_handle_op, 2056]: op name:MUL_MAT, tensor type:f32 [ggml_qnn_can_handle_op, 2080]: src0 type:q8_0 [ggml_qnn_can_handle_op, 2081]: src1 type:f32

[ggml_qnn_can_handle_op, 2089]: src0 tensor_769: type = 8 ( q8_0) ne = 384 x 384 x 1, nb = ( 34, 408, 156672)

[ggml_qnn_can_handle_op, 2094]: src1 tensor_7727: type = 0 ( f32) ne = 384 x 1 x 1, nb = ( 4, 1536, 1536)

[ggml_qnn_can_handle_op, 2100]: tensor_7731: type = 0 ( f32) ne = 384 x 1 x 1, nb = ( 4, 1536, 1536)

[ggml_qnn_can_handle_op, 2056]: op name:SCALE, tensor type:f32 [ggml_qnn_can_handle_op, 2056]: op name:CPY, tensor type:f16 [ggml_qnn_can_handle_op, 2056]: op name:MUL_MAT, tensor type:f32 [ggml_qnn_can_handle_op, 2080]: src0 type:q8_0 [ggml_qnn_can_handle_op, 2081]: src1 type:f32

[ggml_qnn_can_handle_op, 2089]: src0 tensor_770: type = 8 ( q8_0) ne = 384 x 384 x 1, nb = ( 34, 408, 156672)

[ggml_qnn_can_handle_op, 2094]: src1 tensor_7727: type = 0 ( f32) ne = 384 x 1 x 1, nb = ( 4, 1536, 1536)

[ggml_qnn_can_handle_op, 2100]: tensor_7733: type = 0 ( f32) ne = 384 x 1 x 1, nb = ( 4, 1536, 1536)

[ggml_qnn_can_handle_op, 2089]: src0 tensor_767: type = 8 ( q8_0) ne = 384 x 384 x 1, nb = ( 34, 408, 156672)

[ggml_qnn_can_handle_op, 2094]: src1 tensor_7727: type = 0 ( f32) ne = 384 x 1 x 1, nb = ( 4, 1536, 1536)

[ggml_qnn_can_handle_op, 2100]: tensor_7728: type = 0 ( f32) ne = 384 x 1 x 1, nb = ( 4, 1536, 1536)

[ggml_qnn_can_handle_op, 2089]: src0 tensor_783 (view): type = 1 ( f16) ne = 64 x 27 x 6, nb = ( 2, 768, 128)

[ggml_qnn_can_handle_op, 2094]: src1 tensor_7730 (reshaped) (permuted): type = 0 ( f32) ne = 64 x 1 x 6, nb = ( 4, 1536, 256)

[ggml_qnn_can_handle_op, 2100]: tensor_7744: type = 0 ( f32) ne = 27 x 1 x 6, nb = ( 4, 108, 108)

[ggml_qnn_can_handle_op, 2056]: op name:SOFT_MAX, tensor type:f32 [ggml_qnn_can_handle_op, 2056]: op name:MUL_MAT, tensor type:f32 [ggml_qnn_can_handle_op, 2080]: src0 type:f16 [ggml_qnn_can_handle_op, 2081]: src1 type:f32

[ggml_qnn_can_handle_op, 2089]: src0 tensor_784 (view): type = 1 ( f16) ne = 27 x 64 x 6, nb = ( 2, 3072, 196608)

[ggml_qnn_can_handle_op, 2094]: src1 tensor_7745: type = 0 ( f32) ne = 27 x 1 x 6, nb = ( 4, 108, 108)

[ggml_qnn_can_handle_op, 2100]: tensor_7747: type = 0 ( f32) ne = 64 x 1 x 6, nb = ( 4, 256, 256)

[ggml_qnn_can_handle_op, 2056]: op name:CPY, tensor type:f32 [ggml_qnn_can_handle_op, 2056]: op name:MUL_MAT, tensor type:f32 [ggml_qnn_can_handle_op, 2080]: src0 type:q8_0 [ggml_qnn_can_handle_op, 2081]: src1 type:f32

[ggml_qnn_can_handle_op, 2089]: src0 tensor_772: type = 8 ( q8_0) ne = 384 x 384 x 1, nb = ( 34, 408, 156672)

[ggml_qnn_can_handle_op, 2094]: src1 tensor_7749 (copy of tensor_7747 (permuted)): type = 0 ( f32) ne = 384 x 1 x 1, nb = ( 4, 1536, 1536)

[ggml_qnn_can_handle_op, 2100]: tensor_7751: type = 0 ( f32) ne = 384 x 1 x 1, nb = ( 4, 1536, 1536)

[ggml_qnn_can_handle_op, 2089]: src0 tensor_776: type = 8 ( q8_0) ne = 384 x 384 x 1, nb = ( 34, 408, 156672)

[ggml_qnn_can_handle_op, 2094]: src1 tensor_7756: type = 0 ( f32) ne = 384 x 1 x 1, nb = ( 4, 1536, 1536)

[ggml_qnn_can_handle_op, 2100]: tensor_7757: type = 0 ( f32) ne = 384 x 1 x 1, nb = ( 4, 1536, 1536)

[ggml_qnn_can_handle_op, 2089]: src0 tensor_785 (view): type = 1 ( f16) ne = 64 x 1500 x 6, nb = ( 2, 768, 128)

[ggml_qnn_can_handle_op, 2094]: src1 tensor_7758 (reshaped) (permuted): type = 0 ( f32) ne = 64 x 1 x 6, nb = ( 4, 1536, 256)

[ggml_qnn_can_handle_op, 2100]: tensor_7763: type = 0 ( f32) ne = 1500 x 1 x 6, nb = ( 4, 6000, 6000)

[ggml_qnn_can_handle_op, 2089]: src0 tensor_786 (view): type = 1 ( f16) ne = 1500 x 64 x 6, nb = ( 2, 3000, 192000)

[ggml_qnn_can_handle_op, 2094]: src1 tensor_7764: type = 0 ( f32) ne = 1500 x 1 x 6, nb = ( 4, 6000, 6000)

[ggml_qnn_can_handle_op, 2100]: tensor_7765: type = 0 ( f32) ne = 64 x 1 x 6, nb = ( 4, 256, 256)

[ggml_qnn_can_handle_op, 2089]: src0 tensor_781: type = 8 ( q8_0) ne = 384 x 384 x 1, nb = ( 34, 408, 156672)

[ggml_qnn_can_handle_op, 2094]: src1 tensor_7767 (copy of tensor_7765 (permuted)): type = 0 ( f32) ne = 384 x 1 x 1, nb = ( 4, 1536, 1536)

[ggml_qnn_can_handle_op, 2100]: tensor_7769: type = 0 ( f32) ne = 384 x 1 x 1, nb = ( 4, 1536, 1536)

[ggml_qnn_can_handle_op, 2089]: src0 tensor_761: type = 8 ( q8_0) ne = 384 x 1536 x 1, nb = ( 34, 408, 626688)

[ggml_qnn_can_handle_op, 2094]: src1 tensor_7774: type = 0 ( f32) ne = 384 x 1 x 1, nb = ( 4, 1536, 1536)

[ggml_qnn_can_handle_op, 2100]: tensor_7775: type = 0 ( f32) ne = 1536 x 1 x 1, nb = ( 4, 6144, 6144)

[ggml_qnn_can_handle_op, 2089]: src0 tensor_763: type = 8 ( q8_0) ne = 1536 x 384 x 1, nb = ( 34, 1632, 626688)

[ggml_qnn_can_handle_op, 2094]: src1 tensor_7777: type = 0 ( f32) ne = 1536 x 1 x 1, nb = ( 4, 6144, 6144)

[ggml_qnn_can_handle_op, 2100]: tensor_7778: type = 0 ( f32) ne = 384 x 1 x 1, nb = ( 4, 1536, 1536)

[ggml_qnn_can_handle_op, 2089]: src0 tensor_684: type = 8 ( q8_0) ne = 384 x 51864 x 1, nb = ( 34, 408, 21160512)

[ggml_qnn_can_handle_op, 2094]: src1 tensor_7783: type = 0 ( f32) ne = 384 x 1 x 1, nb = ( 4, 1536, 1536)

[ggml_qnn_can_handle_op, 2100]: tensor_7784: type = 0 ( f32) ne = 51864 x 1 x 1, nb = ( 4, 207456, 207456)

[00:00:00.000 --> 00:00:08.000] And so my fellow Americans ask not what your country can do for you [00:00:08.000 --> 00:00:11.000] ask what you can do for your country. [qnn_test_whispercpp, 1435]: whispercpp inference successfully

[qnn_test_whispercpp, 1453]: text[0]:[00:00:00.000 --> 00:00:08.000] And so my fellow Americans ask not what your country can do for you [qnn_test_whispercpp, 1458]: asr result: [ 00:00:00.000 ---> 00:00:08.000 ] And so my fellow Americans ask not what your country can do for you [qnn_test_whispercpp, 1453]: text[1]:[00:00:08.000 --> 00:00:11.000] ask what you can do for your country. [qnn_test_whispercpp, 1458]: asr result: [ 00:00:08.000 ---> 00:00:11.000 ] ask what you can do for your country. [qnn_test_whispercpp, 1461]: inference cost 1952 ms [qnn_test_whispercpp, 1462]: after calling whisper_full [ggml_backend_qnn_free, 3350]: enter ggml_backend_qnn_free [ggml_backend_qnn_free, 3352]: idx 0, name:qnn-cpu [ggml_backend_qnn_free, 3361]: graph type:ADD [qnn_finalize, 1935]: succeed to close rpcmem lib

[ggml_backend_qnn_free, 3375]: leave ggml_backend_qnn_free [whisper_asr_finalize, 1301]: enter whisper_asr_finalize [whisper_asr_finalize, 1320]: leave whisper_asr_finalize [qnn_test_whispercpp, 1485]: whisper ASR result: [ 00:00:00.000 ---> 00:00:08.000 ] And so my fellow Americans ask not what your country can do for you [ 00:00:08.000 ---> 00:00:11.000 ] ask what you can do for your country. [main, 1576]: exit main Bus error 135|houji:/ $ ^C 130|houji:/ $ export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/data/local/tmp
houji:/ $ /data/local/tmp/ggml-qnn-test -t 2 -b 0


The "Bus error" is no relevant to QNN backend pls just ignore it, it's a problem in mixed usage between different Android native dynamic libraries which generated by different cross compilation methods.

mulmat's performance comparion:

[ggml_qnn_mul_mat, 2407]: tensor_41: type = 0 ( f32) ne = 4096 x 4096 x 1, nb = ( 4, 16384, 67108864)

[ggml_qnn_mul_mat, 2408]: 4096, 4096, 1, 1 [ggml_qnn_mul_mat, 2409]: tensor0 name tensor_40 [ggml_qnn_mul_mat, 2410]: tensor1 name tensor_39 [ggml_qnn_mul_mat, 2411]: tensor2 name tensor_41 [ggml_qnn_mul_mat, 2499]: 4096, 4096, 1, 1 [ggml_qnn_logcallback, 1782]: 4557.5ms [ DEBUG ] Setting data pointer for tensor ID: 40

[ggml_qnn_logcallback, 1782]: 4561.7ms [ DEBUG ] Setting data pointer for tensor ID: 41

[ggml_qnn_logcallback, 1782]: 4567.2ms [ DEBUG ] Setting data pointer for tensor ID: 42

[ggml_qnn_logcallback, 1782]: 4567.3ms [ INFO ] CpuGraph::execute

[ggml_qnn_mul_mat, 2537]: duration of ggml_qnn_mul_mat : 1411 milliseconds

[ggml_qnn_mul_mat, 2538]: call ggml_qnn_mul_mat done

[ggml_backend_qnn_free, 3350]: enter ggml_backend_qnn_free [ggml_backend_qnn_free, 3352]: idx 0, name:qnn-cpu [ggml_backend_qnn_free, 3361]: graph type:ADD [qnn_finalize, 1935]: succeed to close rpcmem lib

[ggml_backend_qnn_free, 3375]: leave ggml_backend_qnn_free [qnn_op_ut_automation, 572]: 64 x 64: F16 1.1 GFLOPS (128 runs) | F32 1.4 GFLOPS (128 runs) 128 x 128: F16 6.6 GFLOPS (128 runs) | F32 8.3 GFLOPS (128 runs) 256 x 256: F16 24.4 GFLOPS (128 runs) | F32 70.4 GFLOPS (128 runs) 512 x 512: F16 76.0 GFLOPS (128 runs) | F32 133.3 GFLOPS (128 runs) 1024 x 1024: F16 88.1 GFLOPS ( 42 runs) | F32 160.1 GFLOPS ( 75 runs) 2048 x 2048: F16 101.7 GFLOPS ( 6 runs) | F32 135.4 GFLOPS ( 8 runs) 4096 x 4096: F16 93.7 GFLOPS ( 3 runs) | F32 93.7 GFLOPS ( 3 runs)

[qnn_op_ut_automation, 579]: duration of qnn_ggml_op_automation_ut GGML_OP_MUL_MAT with backend 0(QNN-CPU) is: 18067 milliseconds [qnn_op_ut_automation, 580]: leave qnn_ggml_op_automation_ut(automation unit test)

[qnn_op_ut_automation, 531]: j= 6(matrix dimension = 4096,n_max=128),k=5(ggml quant type=GGML_TYPE_F16),i=0

[qnn_op_ut_automation, 531]: j= 6(matrix dimension = 4096,n_max=128),k=5(ggml quant type=GGML_TYPE_F16),i=1

[qnn_op_ut_automation, 531]: j= 6(matrix dimension = 4096,n_max=128),k=5(ggml quant type=GGML_TYPE_F16),i=2

[qnn_op_ut_automation, 531]: j= 6(matrix dimension = 4096,n_max=128),k=6(ggml quant type=GGML_TYPE_F32),i=0

[qnn_op_ut_automation, 531]: j= 6(matrix dimension = 4096,n_max=128),k=6(ggml quant type=GGML_TYPE_F32),i=1

[qnn_op_ut_automation, 531]: j= 6(matrix dimension = 4096,n_max=128),k=6(ggml quant type=GGML_TYPE_F32),i=2

[qnn_op_ut_automation, 572]: 64 x 64: F16 9.6 GFLOPS (128 runs) | F32 10.1 GFLOPS (128 runs) 128 x 128: F16 14.6 GFLOPS (128 runs) | F32 14.0 GFLOPS (128 runs) 256 x 256: F16 16.4 GFLOPS (128 runs) | F32 16.5 GFLOPS (128 runs) 512 x 512: F16 17.2 GFLOPS ( 65 runs) | F32 17.2 GFLOPS ( 65 runs) 1024 x 1024: F16 16.7 GFLOPS ( 8 runs) | F32 16.7 GFLOPS ( 8 runs) 2048 x 2048: F16 16.2 GFLOPS ( 3 runs) | F32 16.2 GFLOPS ( 3 runs) 4096 x 4096: F16 16.1 GFLOPS ( 3 runs) | F32 16.1 GFLOPS ( 3 runs)

[qnn_op_ut_automation, 579]: duration of qnn_ggml_op_automation_ut GGML_OP_MUL_MAT with backend 3(ggml) is: 82084 milliseconds [qnn_op_ut_automation, 580]: leave qnn_ggml_op_automation_ut(automation unit test)

zhouwg / kantv

ggml-qnn: refine Android command line UT program #229