PoC: Add Qualcomm mobile SoC native backend for GGML

zhouwg commented 3 months ago

Background of this PoC:

1.GGML is a very compact/highly optimization pure C/C++ machine learning library. GGML is also the solid cornerstone of the amazing whisper.cpp and the magic llama.cpp. Compared to some well-known machine learning frameworks/libraries (e.g. Google TensorFlow, Microsoft ONNX, Meta PyTorch, Baidu PaddlePaddle......), GGML does not have much/complex/complicated/redundant/… encapsulation, so it's very very very useful/helpful/educational for AI beginner(such as me).  In general, GGML has following features:

Written in C
16-bit float support
Integer quantization support (4-bit, 5-bit, 8-bit, etc.)
Automatic differentiation
ADAM and L-BFGS optimizers
Optimized for Apple Silicon
On x86 architectures utilizes AVX / AVX2 intrinsics
On ppc64 architectures utilizes VSX intrinsics
No third-party dependencies
Zero memory allocations during runtime
all in one source file and similar to imgui(this is just personal opinion and I really like it, this is a NEW coding style and the requirements for programmer is relatively high and very helpful for experienced programmer, this coding style may not be acceptable in large commercial IT companies because it violates some principles of modern software engineering)

There are four "killer/heavyweight" AI applications based on GGML:
Audio2Text(aka ASR) whisper.cpp
Running LLM locally llama.cpp
Text2Image stable-diffusion.cpp
Text2Speech(aka TTS) bark.cpp

There are also some open source C/C++ open source AI projects/examples based on GGML:
[X] Example of GPT-2 inference examples/gpt-2
[X] Example of GPT-J inference examples/gpt-j
[X] Example of LLaMA training ggerganov/llama.cpp/examples/baby-llama
[X] Example of Falcon inference cmp-nct/ggllm.cpp
[X] Example of BLOOM inference NouamaneTazi/bloomz.cpp
[X] Example of RWKV inference saharNooby/rwkv.cpp
[x] Example of SAM inference examples/sam
[X] Example of BERT inference skeskinen/bert.cpp
[X] Example of BioGPT inference PABannier/biogpt.cpp
[X] Example of Encodec inference PABannier/encodec.cpp
[X] Example of CLIP inference monatis/clip.cpp
[X] Example of MiniGPT4 inference Maknee/minigpt4.cpp
[X] Example of ChatGLM inference li-plus/chatglm.cpp
[X] Example of Qwen inference QwenLM/qwen.cpp
[X] Example of YOLO inference examples/yolo
[X] Example of ViT inference staghado/vit.cpp
[X] Example of multiple LLMs inference foldl/chatllm.cpp

Xiaomi 14 was released in China on 10-26-2023 by one of China’s largest mobile phone giants, Xiaomi 14 was available in Euro since 02-25-2024. Xiaomi 14 contains a very very very powerful mobile SoC------Qualcomm SM8650-AB Snapdragon 8 Gen 3 (4 nm).
Qualcomm is No.1 mobile SoC semiconductor company in our planet currently(MediaTek's market share is No.1 in Q1 2024 but I personally think Qualcomm is the real No.1 mobile SoC vendor in our planet). QNN(Qualcomm Neural Network, aka Qualcomm AI Engine Direct) SDK is verified to work with the following versions of the ML frameworks:

TensorFlow: tf-1.15.0, or tf-2.10.1
TFLite: tflite-2.3.0
PyTorch: torch-1.13.1
ONNX: onnx-1.11.0

As well known,Apple's dedicated machine learning acceleration library(ANE) is very important for performance of ggml/llama.cpp on iOS/Mac.

after finished PoC #64 successfully from 03-05-2024 to 03-16-2024 (planned to complete within a week but failed. this PoC also proves GGML is really a very very very powerful/compact machine learning library and can be used in real application or real complicated scenario on mobile/edge device)
after spent 3 days(from 03-26-2024 to 03-28-2024) on llama.cpp

I want to add Qualcomm mobile SoC native backend for GGML for personal interest or purpose of study AI/machine learning( and study internal mechanism of GGML).

This PoC is a much more difficult task for me because it's my first time using Qualcomm QNN SDK and I don't know anything about real/hardcore AI /machine learning tech.

I'm not sure I can do it this time but I just want to try(and practice my C/C++ programming and troubleshooting skill). so there is NO timeline in this PoC(might be break at any point in the future).

1277368600

This PoC is similar to an opening issue in upstream GGML:https://github.com/ggerganov/ggml/issues/771:

Adding Native Support of SYCL for Intel GPUs https://github.com/ggerganov/llama.cpp/issues/4749

SYCL backend support Multi-card https://github.com/ggerganov/llama.cpp/issues/5282

so the integration work of SYCL will provide a huge/significant reference for this POC:we can learn something from what the Intel R&D team has done with ggml-sycl(ggml-sycl.h, ggml-sycl.cpp).

PR or any help(from AI expert, from upstream GGML community, from Qualcomm, from Xiaomi(the most important customer of Qualcomm in China)......) are both greatly welcomed. Guidance from domain-expert is greatly appreciated if there is a problem in path/direction of this PoC.

All codes in this PoC will be open-sourced in this project and want to be submitted to upstream GGML community as ggml-qnn.h&ggml-qnn.cpp if it's considered or accepted.

zhouwg commented 3 months ago

breakdown task(gradual refinement through Agile method and any help(from AI expert, from upstream GGML community, from Qualcomm, ......) are both greatly welcomed. Guidance from domain-expert is greatly appreciated if there is a problem in path/direction of breakdown task):

PoC-S1:
- PoC-S11: background study of Qualcomm QNN SDK
- PoC-S12: model convert tool in python: GGUF ---> Qualcomm's dedicated format (I'm not sure whether this step is essential for the PoC or it's my misunderstanding?) updated on 03-31-2024,17:00, **could be skipped** because: ![Screenshot from 2024-03-31 17-24-38](https://github.com/zhouwg/kantv/assets/6889919/6c3f4569-7692-416e-a313-92a283adb3cd)
- PoC-S13: integrate GStreamer to this project because GStreamer is not only a powerful multimedia framework in Linux and also officially supported by many semiconductor company(such as Intel, Qualcomm, NVIDIA, NXP......) (updated on 03-29-2024,21:48, this should be a **direction problem** caused by my streaming-media background. GStreamer might be not suitable/appropriate for this PoC, this step **could be removed** accordingly)
PoC-S2&S3: "play" with QNN SDK(S2) and study internal detail of GGML(S3)
- PoC-S21: initial development env for "play" with QNN SDK on Xiaomi14（done)
- PoC-S22: integrate QNN sample code to ggml-jni and make the data path works fine as expected(data path: UI <---> Java <---> ggml-jni <---> QNN SDK <---> CPU) (done)
- PoC-S23: integrate QNN sample code to ggml-jni and make the data path works fine as expected(data path: UI <---> Java <---> ggml-jni <---> QNN SDK <---> GPU) (done)
- PoC-S24: integrate QNN sample code to ggml-jni and make the data path works fine as expected(data path: UI <---> Java <---> ggml-jni <---> QNN SDK <---> DSP) (updated on 04-19-2024, done)
- PoC-S25: build code skeleton(Java layer and native layer) of stage-2 of PoC(done)
- PoC-S26: offload simple f32 2x2 matrix addition operation to QNN CPU backend(milestone, done on 04-07-2024(April-7,2024), 17:30)
- PoC-S27: offload simple f32 2x2 matrix addition operation to QNN GPU backend(done on 04-08-2024)
- PoC-S28: offload simple f32 2x2 matrix addition operation to QNN DSP(HTA) backend(skip on 04-08-2024, because it relies heavily on Qualcomm('s undocumented doc/lib?) and there are more valuable/important things in next steps, this problem will be solved in next stage. updated on 04-19-2024, done)
- PoC-S29: mapping ggml_tensor and a simple GGML OP(matrix addition operation) to QNN tensor & computation graph & CPU backend and get correct result(done on 04-08-2024)
- PoC-S30: mapping ggml_tensor and a simple GGML OP(matrix addition operation) to QNN tensor & computation graph & GPU backend and get correct result(done on 04-08-2024)
- PoC-S31: mapping ggml_tensor and a simple GGML OP(matrix addition operation) to QNN tensor & computation graph & DSP backend and get correct result(skip on 04-08-2024, because it relies heavily on Qualcomm and there are more valuable/important things in next steps, this problem will be solved in next stage. updated on 04-19-2024, done)
- PoC-S32: mapping ggml_tensor and GGML mulmat to QNN tensor & computation graph & CPU backend and get correct result(done on 04-08-2024)
- PoC-S33: mapping GGML mulmat to QNN GPU backend(done on 04-08-2024)
- PoC-S34: mapping GGML mulmat to QNN DSP backend(skip on 04-08-2024, because it relies heavily on Qualcomm and there are more valuable/important things in next steps, this problem will be solved in next stage. updated on 04-19-2024, done)
- PoC-S35: mapping a complicated GGML computation graph to QNN's computation graph(CPU/GPU backend) and get correct result
- PoC-S36: mapping a complicated GGML computation graph to QNN DSP backend and get correct result(skip on 04-08-2024, because it relies heavily on Qualcomm and there are more valuable/important things in next steps, this problem will be solved in next stage. updated on 04-19-2024, done )
- PoC-S37: study the online AI course from Andrew Ng and Mu Li(I prefer Chinese version for more quickly reading, the English version: https://github.com/d2l-ai/d2l-en ), study what's the neural network, study how to implement a simple neural network in C/C++ and then mapping to GGML and then mapping to QNN CPU/GPU backend and get correct result(this step is equal to PoC-S35 actually. updated on 04-10-2024-23:25 after reading online AI course from Andrew Ng & Mu Li very roughly and study more internal mechanism of ggml, this step could be skipped actually although I already know how to do it(I'll submit the source code of PoC-S37 later). thanks to the highly well-designed QNN SDK again, it's really too match with the well-designed ggml)
PoC-S4: real challenge in this PoC, current data path of GGML inference on Qualcomm mobile SoC based Android phone: UI <---> Java <----> ggml-jni <---> whisper.cpp < ---> ggml <--->CPU
- PoC-S41: HLD(high level design) of data path: UI <---> Java <---> ggml-jni < ---> whisper.cpp < ---> ggml <---> ggml-qnn.cpp < ---> QNN CPU backend < ---> CPU (updated on 04-10-2024,23:24, there are not too much HLD work need to do because the well-designed ggml already does it internally. updated on 04-11-2024,23:53, done, thanks to the Intel SYCL backend from a small R&D team in Intel Shanghai branch, I'd like to thanks Qualcomm's highly well-designed QNN SDK again, it's a really ONE SDK)
- PoC-S42: implementation of datapath using QNN CPU backend: UI <---> Java <---> ggml-jni < ---> whisper.cpp < ---> ggml <---> ggml-qnn.cpp < ---> QNN(CPU backend) < ---> CPU(milestone, updated on 04-13-2024, done, data path works as expected with whisper.cpp)
- PoC-S43: implementation of major GGML OP(mulmat) using QNN API in ggml-qnn.cpp(04-15-2024, works but with a minor unknown bug, updated on 04-16-2024, the issue had been fixed)
- PoC-S44: implementation of datapath using QNN GPU backend: UI <---> Java <---> ggml-jni < ---> whisper.cpp < ---> ggml <---> ggml-qnn.cpp < ---> QNN(GPU backend) < ---> CPU(updated on 04-16-2024, done, works as expected with whisper.cpp at the first time. just a workaround method and did not find out the root cause of crash. updated on 04-17-2024, done with a better method)
- PoC-S45: validation of major GGML OP(mulmat) using QNN GPU backend(done, updated on 04-17-2024)
- PoC-S46: implementation of datapath using QNN DSP backend: UI <---> Java <---> ggml-jni < ---> whisper.cpp < ---> ggml <---> ggml-qnn.cpp < ---> QNN(DSP backend) < ---> CPU. will fix the unsolved problem in PoC-S28, PoC-S31, PoC-S34, PoC-S36... in this step (updated on 04-17-2024, skipped, because QNN HTP(aka DSP) backend heavily depend on vendor's Android OS(Xiaomi, Vivo's customized OS based on Qualcomm's BSP. updated on 04-19-2024, done)
- PoC-S47: validation of major GGML OP(mulmat) using QNN DSP backend(updated on 04-17-2024, skipped, because QNN HTP(aka DSP) backend heavily depend on vendor's Android OS(Xiaomi, Vivo's customized OS based on Qualcomm's BSP. updated on 04-19-2024, done)
- PoC-S48: validate PoC-S42/PoC-S44/PoC-S46(aka QNN backend) by whisper.cpp asr benchmark or real-time subtitle (updated on 04-13-2024, data path works fine as expected, asr result is not correct because lack of implementation of other GGML OPs using QNN API. updated on 04-17-2024, done, QNN backend(CPU&GPU) works fine/well with whisper.cpp asr benchmark at the first time)
- PoC-S49: implementation of other GGML OPs using QNN API
PoC-S5: improve quality of ggml-qnn.cpp(added on 04-21-2024)
- PoC-S51: resource management of internal QNN resources in ggml-qnn.cpp(done on 04-21-2024)
- PoC-S52: multi-threading not support using QNN GPU/DSP backend in ggml-qnn.cpp(updated on 04-22-2024, multithreading works fine using QNN CPU backend)
- PoC-S53: stability issue during toggle between different backend(QNN CPU/GPU/DSP backend, ggml...)(updated on 04-22-2024, find out the rootcause and done)
- PoC-S54: validate with llama.cpp using QNN backend(updated on 04-22-2024, works as expected)
PoC-S6: PR to upstream GGML
- PoC-S61: refine code(remove dependence......) and prepare to submit to upstream GGML(done on 04-23-2024)
- PoC-S62: merge code to master branch(done on 04-23-2024)
- PoC-S63: PR aux-code to upstream whisper.cpp (done)
- PoC-S64: PR aux-code to upstream llama.cpp (done)
- PoC-S65: merge source code from upstream llama.cpp(merge upstream codes, validation with whisper asr benchmark, validation with llama inference), refine code......(done on 04-24-2024: whisper.cpp and llama.cpp works fine/well using QNN CPU/GPU/HTP(aka DSP) backend on Xiaomi 14(a high-end Android phone based on Qualcomm's state-of-the-art mobile SoC)
- PoC-S66:PR ggml-qnn.cpp&ggml-qnn.h to upstream llama.cpp(because llama.cpp is the main playground of ggml)done on 04-24-2024)

zhouwg commented 3 months ago

background study:

QNN SDK:

there are two types of AI SDK provided by Qualcomm, the QNN (aka Qualcomm® AI Engine Direct) SDK will be used in this PoC(guidance from domain-expert is greatly appreciated if there is a problem in path/direction here).

qualcomm-ai-software-stack

Screenshot from 2024-04-14 11-42-14

Screenshot from 2024-04-14 14-19-01 https://developer.qualcomm.com/sites/default/files/attachments/qnn_software_stack.png

https://qpm.qualcomm.com/#/main/tools/details/qualcomm_ai_engine_direct

https://developer.qualcomm.com/software/hexagon-dsp-sdk/tools

https://developer.qualcomm.com/forum/qdn-forums/software/hexagon-dsp-sdk/toolsinstallation/70818

https://www.qualcomm.com/agreements

https://developer.qualcomm.com/forum/qdn-forums/software/qualcomm-neural-processing-sdk/71578 (the difference between SNPE and QNN）

https://docs.qualcomm.com/bundle/publicresource/topics/80-64748-1

QNN SDK online SDM(software developer's manual): https://docs.qualcomm.com/bundle/publicresource/topics/80-63442-50/introduction.html

Inference at the edge: https://github.com/ggerganov/llama.cpp/discussions/205

backend : offload large batches to GPU https://github.com/ggerganov/llama.cpp/pull/6083

explanation of k-quants https://github.com/ggerganov/llama.cpp/pull/1684

zhouwg commented 3 months ago

updated on 03-30-2024,19:14, PoC-S21 & PoC-S22 done

did NOT touch anything core stuff in this commit just an integration/troubleshooting work. challenge might be happened in next step.

1372217447

zhouwg commented 3 months ago

updated on 03-31-2024,20:21, PoC-S23 & PoC-S25 done

did NOT touch anything core stuff in this commit just an integration work. challenge might be happened in next step.
but I suddenly got an inspiration/idea cause of my streaming-media background after study SDM(software developer's manual) of QNN SDK: whether the scenario here is similar to hardware video decoding via OpenMAX IL directly or Android MediaCodec indirectly? or similar to hardware video decrypt in Widevine L1?

957581016

zhouwg commented 3 months ago

updated on 04-03-2024,20:38

blocked on PoC-S26:offload simple f32 2x2 matrix addition operation to QNN CPU backend.

external help from domain expert is greatly welcomed and appreciated.

updated on 04-02-204, domain technical expert from Qualcomm:

@quic, @chunit-quic, @haowhsu-quic, @chiwwang, @shewu-quic, @quic-zhanweiw,

updated on 04-03-2024, domain technical expert from Qualcomm:

@quic-ppant， @mynameistechno，

I'm sorry to interrupt you, could you help take a look? this is an Android turn-key project on Qualcomm SoC based Android phone(Xiaomi 14 --- Snapdragon 8 Gen 3 --- is preferred) and easily to reproduce the issue. thanks so much.

my output/progress on 04/01/2024 - 04/03/2024 could be found here, FYI:

https://github.com/zhouwg/kantv/blob/kantv-poc-with-qnn/core/ggml/jni/ggml-qnn.cpp https://github.com/zhouwg/kantv/blob/kantv-poc-with-qnn/core/ggml/jni/Inception_v3.cpp

Screenshot from 2024-04-03 20-38-25

chiwwang commented 3 months ago

In general I suggest to ask questions on Qualcomm forum. Nonetheless, it's very possible that some fields are wrong in OpConfig structure so it's rejected by QNN CPU.

Instead of using QNN APIs directly, I suggest to consider SNPE, QNN TFLiteDelegate, onnxruntime-qnn-ep or Executorch. They provides higher level APIs and should be friendlier.

zhouwg commented 3 months ago

In general I suggest to ask questions on Qualcomm forum. Nonetheless, it's very possible that some fields are wrong in OpConfig structure so it's rejected by QNN CPU.

Instead of using QNN APIs directly, I suggest to consider SNPE, QNN TFLiteDelegate, onnxruntime-qnn-ep or Executorch. They provides higher level APIs and should be friendlier.

Appreciate too much for your help and guidance.

I'm an individual freelancer Android system software programmer(not a company employee) and want to adding Qualcomm backend for GGML for personal interest or purpose of study. The background of this POC could be found at the beginning of this page.

GGML is a compact(without much encapsulation) and powerful machine learning library for machine learning beginners and C/C++ programmers which come from open source community . There is no Qualcomm's official backend for GGML currently, so QNN SDK is preferred for this PoC because QNN SDK is a very low-level userspace API without much encapsulation and it's API is stable.

This PoC already referenced a lot from Executorch and QNN sample:

https://github.com/zhouwg/kantv/blob/kantv-poc-with-qnn/core/ggml/jni/ggml-qnn.cpp#L8

chiwwang commented 3 months ago

In general I suggest to ask questions on Qualcomm forum. Nonetheless, it's very possible that some fields are wrong in OpConfig structure so it's rejected by QNN CPU.

Instead of using QNN APIs directly, I suggest to consider SNPE, QNN TFLiteDelegate, onnxruntime-qnn-ep or Executorch. They provides higher level APIs and should be friendlier.

Appreciate too much for your help and guidance.

I'm an individual freelancer Android system software programmer(not a company employee) and want to adding Qualcomm backend for GGML for personal interest or purpose of study. The background of this POC could be found at the beginning of this page.

GGML is a compact(without much encapsulation) and powerful machine learning library for machine learning beginners and C/C++ programmers which come from open source community . There is no Qualcomm's official backend for GGML currently, so QNN SDK is preferred for this PoC because QNN SDK is a very low-level userspace API without much encapsulation and it's API is stable.

This PoC already referenced a lot from Executorch and QNN sample:

https://github.com/zhouwg/kantv/blob/kantv-poc-with-qnn/core/ggml/jni/ggml-qnn.cpp#L8

I see. Thank you for interest in QNN.

I assume you have QNN SDK and documents. A way we can debug this kind of error is to compare what we filled into OpConfig with what we know to work. For example, you can create a single layer matmul and use ExecuTorch to delegate it. After making sure it works, we can turn on QNN saver by setting this option: https://github.com/pytorch/executorch/blob/399482c333dea26500bd79956f0e1299f803a056/backends/qualcomm/utils/utils.py#L196

Instead of compiling out a binary, you will see a saver_output.c, which is how QNN C-API is used to create the working binary. Then we can check how OpConfig is filled.

However, I don't really recommend to reference ExecuTorch codes here... Executorch is in Python and it has quite a few interesting properties, which buries QNN quite deeply. (At least they can distract you from what is the real QNN part.)

Instead of ExecuTorch, I recommend to learn QNN APIs by QNN Converter plus the Saver backend mentioned above. You can find how to use them in QNN docs. I recommend to use qnn-onnx-converter since ONNX is also an easy-to-read format. Then we can compare the source model, the converted .cpp, and the saver_output.c, which should help us a lot about understanding QNN APIs.

zhouwg commented 3 months ago

@chiwwang, thanks for your time and appreciate so much for your guidance. your guidance is really helpful. I got the point, now the QNN pipeline works.

thanks so much. words can not express my sincerely thanks for great help from you.

1075271895

there is another issue:I don't know why compute result is incorrect.

https://github.com/zhouwg/kantv/blob/kantv-poc-with-qnn/core/ggml/jni/ggml-qnn.cpp#L2650

zhouwg commented 3 months ago

@chiwwang, thanks so much for your great help and guidance again. I think I understand a little bit more about the QNN SDK and what's the tensor and what's the computation graph at the moment. your guidance is very important for this progress.

now PoC-S26 finished: offload a simple f32 2x2 matrix to QNN CPU backend. it's a really milestone in this PoC.

this is a workaround method, not perfect(this method is a very straight way(just like the great GGML --- without much encapsulation compare to other famous/huge machine learning framework/library) and will be used in PoCS42---PoCS44, but there are some unsolved problems in function qnn_matrix, so I think it's NOT perfect in this commit).

2078758147

zhouwg commented 3 months ago

updated on 04-08-2024(April-08-2024) with this commit:

PoC-S27 finished: offload a simple f32 2x2 matrix addition operation to QNN GPU backend(thanks to the highly well-designed QNN SDK from Qualcomm) 278493012

PoC-S29&S30&32&33 finished: mapping ggml_tensor with add/mul/mulmat operation to QNN tensor with QNN CPU/GPU backend

1396750962 652831785 207374942 43145139

1421179163 1720959012

zhouwg commented 2 months ago

updated on 04-09-2024, a minor refinement:

2116653310 6743352

355013320

zhouwg commented 2 months ago

updated on 04-10-2024, a minor refinement: now the APK could works well/running well on any mainstream Qualcomm mobile SoC base Android phone, NOT limited on Xiaomi 14 or Qualcomm Snapdragon 8 Gen 3 SoC based Android phone.

(the below test phone is Xiaomi 14 which contains a Qualcomm's high-end Snapdragon 8 Gen 3 mobile SoC) 877281587

(the price of below test phone is about RMB 800-1200 (or USD 120 - 166, the price was RMB 1200 when I purchased it) which contains a Qualcomm's low-end mobile SoC. there is a minor UI adjustment because of screen width and height is not enough on this test phone) Screenshot_2024_0410_154952

Screenshot_2024_0410_155316

tongji1907 commented 2 months ago

i am interested in your project , and I base in Shanghai, how can I reach you ?

zhouwg commented 2 months ago

i am interested in your project , and I base in Shanghai, how can I reach you ?

thanks so much for your interesting in this AI learning project.

this AI learning project focus on/powered by GGML(pls refer to background introduction of this PoC to get the reason) and focus on three heavyweight AI applications on Android device: whisper.cpp, llama.cpp, stablediffusion.cpp and try to uniform them to an ONE AI application(this project).

there are some tasks in to-do list currently:

submit PR for this PoC accordingly
improve quality of real-time subtitle(#84)
bugfix with stablediffusion.cpp on Xiaomi14(not work currently, or pls refer to this opening issue in upstream project: https://github.com/leejet/stable-diffusion.cpp/issues/220)
bugfix with llama.cpp on Xiaomi14(#116)
participate in bugfix in Java layer/native layer

Xiaomi 14 or other Qualcomm Snapdragon 8 Gen 3 SoC based Android phone is preferred for this PoC or this AI learning project.

thanks again.

zhouwg commented 2 months ago

Summary of this PoC

core implementation(data path works as expected with whisper.cpp using QNN CPU backend) has been completed on 04/13/2024, the main part of this PoC(Proof of Concept, not a commercial product) has been completed
core implementation(data path works fine as expected with whisper.cpp using QNN CPU/GPU backend) has been completed on 04-17-2024
core implementation(data path works fine as expected with whisper.cpp using QNN HTP(aka DSP) backend) has been completed on 04-19-2024
validate with llama.cpp using QNN CPU/GPU/HTP(aka DSP) backend on Xiaomi 14 and works fine as expceted on 04/24/2024
implementation of GGML_OP_ADD/ GGML_OP_MUL/ GGML_OP_MUL_MAT using QNN API could be found in ggml-qnn.cpp
(updated on 04-17-2024)codes in UI layer & JNI layer was refined, now the UI is more make sense / more clear to user(I hope no confusion for user from now on)
a simple UT framework was added for "PoC-S49: implementation of GGML OPs using QNN API"

Todo(improve the quality of Qualcomm QNN backend for GGML)

1. lack of resource management of internal QNN resources in ggml-qnn.cpp (done on 04-21-2024, not perfect method)
1. lack of multithread supportive in ggml-qnn.cpp(multithread works fine with QNN CPU backend on 04-22-2024, multithread not work using QNN GPU&HTP(aka DSP) backend )
1. lack of stability(sometimes the APK could crash during toggle between different QNN backend(CPU, GPU, DSP))(done on 04-22-2024 after find out the rootcause)
1. validate with llama.cpp using QNN backend(done:works as expected on 04-22-2024, works fine/well as expected on 04-24-2024)
1. multi QNN backend(CPU/GPU/DSP) simultaneously not support(I'll try it after PR to upstream)
1. QNN's RPC feature(which useful for QNN HTP(aka DSP) backend) not used(I'll try it after PR to upstream)
1. only support FP32 / FP16 and the input and output tensors must be of the same data type, this limitation would be refined/improved by community in upstream whiserp.cpp/llama.cpp if the PR could be accepted by upstream GGML community
1. lack of implementation of other GGML-OPs using QNN API. this work is very similar to GGML_OP_ADD / GGML_OP_MUL / GGML_OP_MULMAT in ggml-qnn.cpp. this task would be done by community in upstream whiserp.cpp/llama.cpp if the PR could be accepted by upstream GGML community
1. merge branch "kantv-poc-with-qnn" to master branch(done on 04-23-2024)

Highlights in 2nd milestone

a complicated computation graph using QNN API(a reverse engineering implementation by Project KanTV)

![Screenshot_2024_0414_113706](https://github.com/zhouwg/kantv/assets/6889919/9dc327f1-f4ea-4d1d-ad18-61e891c00cb6)

(1)data path of GGML's QNN(CPU&GPU) backend (only support GGML_OP_ADD,GGML_OP_MUL, GGML_OP_MUL_MAT) works fine/well as expected with whisper.cpp on Qualcomm's SoC based Android phone(from low-end to high-end, or from RMB 800 - RMB 8000 , or from USD 110 - USD 1100). (2)data path of GGML's QNN (CPU&GPU& HTP(aka DSP) backend(only support GGML_OP_ADD, GGML_OP_MUL, GGML_OP_MUL_MAT) works fine/well as expected with whisper.cpp on Qualcomm's SoC based high-end phone(Xiaomi 14). (3)data path of GGML's QNN (CPU&GPU& HTP(aka DSP) backend(only support GGML_OP_ADD, GGML_OP_MUL, GGML_OP_MUL_MAT) works fine/well as expected with llama.cpp on Qualcomm's SoC based high-end phone(Xiaomi 14). in other words, it's a real Qualcomm's QNN backend of GGML although it's not perfect

.

![Screenshot_2024_0417_133635](https://github.com/zhouwg/kantv/assets/6889919/38ae8121-7425-49c1-94dd-a912032682b8) ![Screenshot_2024_0417_133531](https://github.com/zhouwg/kantv/assets/6889919/5cb594a4-038e-4c15-8b11-b19c1f54be16) ![Screenshot_2024_0417_133240](https://github.com/zhouwg/kantv/assets/6889919/b9fe8775-04da-44d4-8e8a-342301fb119c) ![Screenshot from 2024-04-17 19-52-58](https://github.com/zhouwg/kantv/assets/6889919/80be637b-c1b9-45dc-a1c8-477fe864aea7) ![426090730](https://github.com/zhouwg/kantv/assets/6889919/dcffc309-a4c3-4606-8536-d8ad167da76f) ![504893116](https://github.com/zhouwg/kantv/assets/6889919/51f0b277-eca4-4938-86f5-415dbf5897e7)

a simple UT framework(so-called) was added for PoC-S49:implementation of other GGML OPs using QNN API

![Screenshot_2024_0418_131905](https://github.com/zhouwg/kantv/assets/6889919/b53d5309-8c25-437e-9e7f-71c12d61f013) ![Screenshot_2024_0418_131835](https://github.com/zhouwg/kantv/assets/6889919/b8f4bee6-01ba-4ea4-8734-396b18017425) ![Screenshot_2024_0418_131803](https://github.com/zhouwg/kantv/assets/6889919/e998f61a-6e33-412a-bde9-432bf183dcb4) ![Screenshot_2024_0418_131723](https://github.com/zhouwg/kantv/assets/6889919/e1dde47d-cdae-4ece-88ec-653383712670) ![Screenshot_2024_0418_131215](https://github.com/zhouwg/kantv/assets/6889919/8e6d532e-7c7a-4265-b41a-1a7e7f51fc8a) ![Screenshot_2024_0418_131018](https://github.com/zhouwg/kantv/assets/6889919/eeb9e72b-71b4-4932-809b-8b49f21ec29f) ![940276724](https://github.com/zhouwg/kantv/assets/6889919/ad097129-35c5-4118-b8c2-c420f4910888)

4x performance gains for GGML_OP_MUL_MAT using QNN CPU backend with 1 thread on a Qualcomm mobile SoC based high-end Android phone(Xiaomi 14)

![1922265373](https://github.com/zhouwg/kantv/assets/6889919/d83ed630-d105-4818-9870-8cf539446c75) ![250505401](https://github.com/zhouwg/kantv/assets/6889919/cc3f450a-baac-4bce-be8c-0b5d1ea76efa)

Acknowledgements

thanks to the great/breakthrough help from @chiwwang(a technical expert from Qualcomm), thanks to the highly well-designed Qualcomm's QNN SDK(I personally think it's really match with GGML), thanks to the excellent implementation of Intel SYCL backend of GGML(I really learned a lot from this implementation which come from a great open-mind company)... so this PoC could be done from 03-29-2024 to 04-22-2024 --- something I did not expect at all at the very beginning of this PoC. at the end of this section, I'd like to express my sincerely thanks to the original author(authors) of the great GGML/whisper.cpp because I'm a real AI beginner and learned a lot of interesting things from these two great open-source C/C++ AI projects.

all the source codes(no any reserved in local) in this PoC could be found at branch "kantv-poc-with-qnn" and hope it's a little useful/helpful for programmers like me(know nothing about real AI tech)because I'm an elder programmer who does not belong to this great era/2020s. the codes of this PoC will be merged to master branch in next few days because I think it's stable enough to be merged into master branch.

implementation of PoC-S2 & S3:"play with / say hello to" Qualcomm's QNN SDK and study internal detail of GGML

header file of implementation of Qualcomm's QNN backend of GGML source file of implementation of Qualcomm's QNN backend of GGML

caofx0418 commented 2 months ago

高手，在高通8gen3上，llama2-7B推理性能如何？多少token/s ?

zhouwg commented 2 months ago

高手，在高通8gen3上，llama2-7B推理性能如何？多少token/s ?

谢谢。

性能不错，大概20 tokens/s，这还没有用到高通的硬件AI加速引擎。

anyway，衷心希望高通亲自下场类似Intel那样投人投钱投资源做SYCL后端一样做QNN后端：这样大家都省心了，直接拿来用就可以了。

感谢天才程序员Georgi Gerganov 为程序员为人类带来了GGML(同时感谢设计精巧的ggml backend subsystem的作者，虽然对其某些似乎固执的决策有所保留，但其设计精巧的backend subsystem给一个不懂硬核AI技术的我有了一点点发挥的空间）。Georgi Gerganov的确是自70后（公开资料）FFmpeg原始作者Fabrice Bellard之后的又一位来自欧洲的充满理想主义精神的90后（推测）天才程序员。

caofx0418 commented 2 months ago

Llama 7b 有20token 太棒了！

使用llama.cpp在8gen3的cpu上4线程推理llama2-7B最多只有5token

zhouwg commented 2 months ago

Llama 7b 有20token 太棒了！

使用llama.cpp在8gen3的cpu上4线程推理llama2-7B最多只有5token

您的数据没问题，之前是我记错了，向您致歉。 167078394

1379554357

Google的gemma在小米14上可以跑到20 tokens/s，纯CPU，没有用到高通后端，贴一幅4月24日的截图供参考。

504893116

目前的QNN后端只是一个基本雏形(data path跑通了)，在小米14上whisper.cpp的测试性能赶不上纯CPU推理。个人的理解：高通花费了很大力气构建的QNN(aka AI Direct) SDK是AI时代高通平台上的闭源FFmpeg，需要程序员花费精力去研究如何正确高效的使用以充分发挥高通平台各个计算单元（异构多核）的硬件加速能力。本来打算再接再厉花费精力将ggml的QNN后端不断完善在社区的共同努力下最终接近产品质量的。anyway, I don't care it(whether the PR be approved by upstream community although I hope it becomes true) at the moment.

nihui commented 2 months ago

敬佩！

zhouwg commented 2 months ago

敬佩！

谢谢。贵公司还有您个人的几个开源项目做的非常好。

zhouwg commented 1 month ago

敬佩！

5月份开始认真学习ncnn后才发现您是腾讯的高级研究员与工业界大名鼎鼎的AI专家。

的确太孤陋寡闻了：ncnn 2017年就开源了，2024年4月底才偶然看到。

这个AI学习项目使用/复用/参考了您的很多ncnn相关example代码，非常感谢！

zhouwg / kantv