zhouwg / kantv

workbench for learing&practising AI tech in real scenario on Android device, powered by GGML(Georgi Gerganov Machine Learning) and NCNN(Tencent NCNN) and FFmpeg
Apache License 2.0
96 stars 17 forks source link

ggml-qnn: refine ggml backend subsystem #216

Closed zhouwg closed 1 month ago

zhouwg commented 1 month ago

This PR is equal to a PR in upstream GGML communtiy:

https://github.com/ggerganov/llama.cpp/pull/7641

Unfortunately, the PR in upstream was closed by the maintainer of ggml backend subsystem very quickly/immediately less then 1 minute after I summited this PR in upstream llama.cpp.

I totally disagree with what the maintainer of ggml backend subsystem said(because some special backends only need system memory):

"There are too many things wrong here to list. At the most basic level, this approach will not work because backends typically have a memory that is not accessible from other backends, and when switching to a different backend it is necessary to ensure that all the tensors required to evaluate the graph are available in the backend memory. This is the main job of ggml_backend_sched."

zhouwg commented 1 month ago

whisper and llm and minicpm-v infereence works fine with this PR(mixed inference between Qualcomm's CPU&GPU / CPU&NPU).

next step: (1) buf fix in JNI layer, this is not important in development stage(there are three known bugs in JNI layer);

(2)QNN performance fine-tuning: focus on matmul because the performance of QNN's matmul is 2x - 10x then original GGML's matmul, all the other GGML OPs will be computed by the original GGML in the CPU side, just offload the matmul in QNN side.