ggml-qnn: refine code and keep sync ggml-qnn.cpp&ggml-qnn.h between local and PR in upstream

validated on Xiaomi 14(a Qualcomm Snapdragon 8 Gen 3 mobile SoC based Android phone) with following cases:

mulmat with QNN CPU/GPU/NPU backend along different threads(1-8)

qnn-auto-ut(add, mulmat) with QNN CPU/GPU/NPU backend along different threads(1-8). there is a minor bug in automation test of GGML mul OP and I'll fix it in the next commit.

whispercpp inference with QNN CPU/GPU/NPU backend along different threads(1-8)

llamacpp inference with QNN CPU/GPU/NPU backend along different threads(1-8)

all above testcases works fine as expected

this PR was reverted. move to https://github.com/zhouwg/kantv/pull/215

zhouwg / kantv

ggml-qnn: refine code and keep sync ggml-qnn.cpp&ggml-qnn.h between local and PR in upstream #214