ggml-qnn: refine code and keep sync ggml-qnn.cpp&ggml-qnn.h between local and PR in upstream

validated on Xiaomi 14(a Qualcomm Snapdragon 8 Gen 3 mobile SoC based Android phone) with following cases:

mulmat with QNN CPU/GPU/NPU backend along different threads(1-8)

qnn-auto-ut(add, mulmat) with QNN CPU/GPU/NPU backend along different threads(1-8). there is a bug in automation test of GGML mul OP.

whispercpp inference with QNN CPU/GPU/NPU backend along different threads(1-8)

llamacpp inference with QNN CPU/GPU/NPU backend along different threads(1-8)

all above testcases works fine as expected

zhouwg / kantv