Closed zhouwg closed 1 month ago
whisper and llm and minicpm-v infereence works fine with this PR(mixed inference between Qualcomm's CPU&GPU / CPU&NPU).
next step: (1) buf fix in JNI layer, this is not important in development stage(there are three known bugs in JNI layer);
(2)QNN performance fine-tuning: focus on matmul because the performance of QNN's matmul is 2x - 10x then original GGML's matmul, all the other GGML OPs will be computed by the original GGML in the CPU side, just offload the matmul in QNN side.
This PR is equal to a PR in upstream GGML communtiy:
https://github.com/ggerganov/llama.cpp/pull/7641
Unfortunately, the PR in upstream was closed by the maintainer of ggml backend subsystem very quickly/immediately less then 1 minute after I summited this PR in upstream llama.cpp.
I totally disagree with what the maintainer of ggml backend subsystem said(because some special backends only need system memory):
"There are too many things wrong here to list. At the most basic level, this approach will not work because backends typically have a memory that is not accessible from other backends, and when switching to a different backend it is necessary to ensure that all the tensors required to evaluate the graph are available in the backend memory. This is the main job of ggml_backend_sched."