Closed zhouwg closed 1 month ago
breakdown task(gradual refinement through Agile method and any help(from AI expert, from upstream GGML community, from Qualcomm, ......) are both greatly welcomed. Guidance from domain-expert is greatly appreciated if there is a problem in path/direction of breakdown task):
background study:
there are two types of AI SDK provided by Qualcomm, the QNN (aka Qualcomm® AI Engine Direct) SDK will be used in this PoC(guidance from domain-expert is greatly appreciated if there is a problem in path/direction here).
https://developer.qualcomm.com/sites/default/files/attachments/qnn_software_stack.png
https://qpm.qualcomm.com/#/main/tools/details/qualcomm_ai_engine_direct
https://developer.qualcomm.com/software/hexagon-dsp-sdk/tools
https://developer.qualcomm.com/forum/qdn-forums/software/hexagon-dsp-sdk/toolsinstallation/70818
https://www.qualcomm.com/agreements
https://developer.qualcomm.com/forum/qdn-forums/software/qualcomm-neural-processing-sdk/71578 (the difference between SNPE and QNN)
https://docs.qualcomm.com/bundle/publicresource/topics/80-64748-1
QNN SDK online SDM(software developer's manual): https://docs.qualcomm.com/bundle/publicresource/topics/80-63442-50/introduction.html
Inference at the edge: https://github.com/ggerganov/llama.cpp/discussions/205
backend : offload large batches to GPU https://github.com/ggerganov/llama.cpp/pull/6083
explanation of k-quants https://github.com/ggerganov/llama.cpp/pull/1684
updated on 03-30-2024,19:14, PoC-S21 & PoC-S22 done
did NOT touch anything core stuff in this commit just an integration/troubleshooting work. challenge might be happened in next step.
updated on 03-31-2024,20:21, PoC-S23 & PoC-S25 done
did NOT touch anything core stuff in this commit just an integration work. challenge might be happened in next step.
but I suddenly got an inspiration/idea cause of my streaming-media background after study SDM(software developer's manual) of QNN SDK: whether the scenario here is similar to hardware video decoding via OpenMAX IL directly or Android MediaCodec indirectly? or similar to hardware video decrypt in Widevine L1?
updated on 04-03-2024,20:38
blocked on PoC-S26:offload simple f32 2x2 matrix addition operation to QNN CPU backend.
external help from domain expert is greatly welcomed and appreciated.
updated on 04-02-204, domain technical expert from Qualcomm:
@quic, @chunit-quic, @haowhsu-quic, @chiwwang, @shewu-quic, @quic-zhanweiw,
updated on 04-03-2024, domain technical expert from Qualcomm:
@quic-ppant, @mynameistechno,
I'm sorry to interrupt you, could you help take a look? this is an Android turn-key project on Qualcomm SoC based Android phone(Xiaomi 14 --- Snapdragon 8 Gen 3 --- is preferred) and easily to reproduce the issue. thanks so much.
my output/progress on 04/01/2024 - 04/03/2024 could be found here, FYI:
https://github.com/zhouwg/kantv/blob/kantv-poc-with-qnn/core/ggml/jni/ggml-qnn.cpp https://github.com/zhouwg/kantv/blob/kantv-poc-with-qnn/core/ggml/jni/Inception_v3.cpp
In general I suggest to ask questions on Qualcomm forum. Nonetheless, it's very possible that some fields are wrong in OpConfig structure so it's rejected by QNN CPU.
Instead of using QNN APIs directly, I suggest to consider SNPE, QNN TFLiteDelegate, onnxruntime-qnn-ep or Executorch. They provides higher level APIs and should be friendlier.
In general I suggest to ask questions on Qualcomm forum. Nonetheless, it's very possible that some fields are wrong in OpConfig structure so it's rejected by QNN CPU.
Instead of using QNN APIs directly, I suggest to consider SNPE, QNN TFLiteDelegate, onnxruntime-qnn-ep or Executorch. They provides higher level APIs and should be friendlier.
Appreciate too much for your help and guidance.
I'm an individual freelancer Android system software programmer(not a company employee) and want to adding Qualcomm backend for GGML for personal interest or purpose of study. The background of this POC could be found at the beginning of this page.
GGML is a compact(without much encapsulation) and powerful machine learning library for machine learning beginners and C/C++ programmers which come from open source community . There is no Qualcomm's official backend for GGML currently, so QNN SDK is preferred for this PoC because QNN SDK is a very low-level userspace API without much encapsulation and it's API is stable.
This PoC already referenced a lot from Executorch and QNN sample:
https://github.com/zhouwg/kantv/blob/kantv-poc-with-qnn/core/ggml/jni/ggml-qnn.cpp#L8
In general I suggest to ask questions on Qualcomm forum. Nonetheless, it's very possible that some fields are wrong in OpConfig structure so it's rejected by QNN CPU.
Instead of using QNN APIs directly, I suggest to consider SNPE, QNN TFLiteDelegate, onnxruntime-qnn-ep or Executorch. They provides higher level APIs and should be friendlier.
Appreciate too much for your help and guidance.
I'm an individual freelancer Android system software programmer(not a company employee) and want to adding Qualcomm backend for GGML for personal interest or purpose of study. The background of this POC could be found at the beginning of this page.
GGML is a compact(without much encapsulation) and powerful machine learning library for machine learning beginners and C/C++ programmers which come from open source community . There is no Qualcomm's official backend for GGML currently, so QNN SDK is preferred for this PoC because QNN SDK is a very low-level userspace API without much encapsulation and it's API is stable.
This PoC already referenced a lot from Executorch and QNN sample:
https://github.com/zhouwg/kantv/blob/kantv-poc-with-qnn/core/ggml/jni/ggml-qnn.cpp#L8
I see. Thank you for interest in QNN.
I assume you have QNN SDK and documents. A way we can debug this kind of error is to compare what we filled into OpConfig with what we know to work. For example, you can create a single layer matmul and use ExecuTorch to delegate it. After making sure it works, we can turn on QNN saver by setting this option: https://github.com/pytorch/executorch/blob/399482c333dea26500bd79956f0e1299f803a056/backends/qualcomm/utils/utils.py#L196
Instead of compiling out a binary, you will see a saver_output.c
, which is how QNN C-API is used to create the working binary.
Then we can check how OpConfig is filled.
However, I don't really recommend to reference ExecuTorch codes here... Executorch is in Python and it has quite a few interesting properties, which buries QNN quite deeply. (At least they can distract you from what is the real QNN part.)
Instead of ExecuTorch, I recommend to learn QNN APIs by QNN Converter plus the Saver backend mentioned above.
You can find how to use them in QNN docs. I recommend to use qnn-onnx-converter since ONNX is also an easy-to-read format.
Then we can compare the source model, the converted .cpp, and the saver_output.c
, which should help us a lot about understanding QNN APIs.
@chiwwang, thanks for your time and appreciate so much for your guidance. your guidance is really helpful. I got the point, now the QNN pipeline works.
thanks so much. words can not express my sincerely thanks for great help from you.
there is another issue:I don't know why compute result is incorrect.
https://github.com/zhouwg/kantv/blob/kantv-poc-with-qnn/core/ggml/jni/ggml-qnn.cpp#L2650
@chiwwang, thanks so much for your great help and guidance again. I think I understand a little bit more about the QNN SDK and what's the tensor and what's the computation graph at the moment. your guidance is very important for this progress.
now PoC-S26 finished: offload a simple f32 2x2 matrix to QNN CPU backend. it's a really milestone in this PoC.
this is a workaround method, not perfect(this method is a very straight way(just like the great GGML --- without much encapsulation compare to other famous/huge machine learning framework/library) and will be used in PoCS42---PoCS44, but there are some unsolved problems in function qnn_matrix, so I think it's NOT perfect in this commit).
updated on 04-08-2024(April-08-2024) with this commit:
PoC-S27 finished: offload a simple f32 2x2 matrix addition operation to QNN GPU backend(thanks to the highly well-designed QNN SDK from Qualcomm)
PoC-S29&S30&32&33 finished: mapping ggml_tensor with add/mul/mulmat operation to QNN tensor with QNN CPU/GPU backend
updated on 04-09-2024, a minor refinement:
updated on 04-10-2024, a minor refinement: now the APK could works well/running well on any mainstream Qualcomm mobile SoC base Android phone, NOT limited on Xiaomi 14 or Qualcomm Snapdragon 8 Gen 3 SoC based Android phone.
(the below test phone is Xiaomi 14 which contains a Qualcomm's high-end Snapdragon 8 Gen 3 mobile SoC)
(the price of below test phone is about RMB 800-1200 (or USD 120 - 166, the price was RMB 1200 when I purchased it) which contains a Qualcomm's low-end mobile SoC. there is a minor UI adjustment because of screen width and height is not enough on this test phone)
i am interested in your project , and I base in Shanghai, how can I reach you ?
i am interested in your project , and I base in Shanghai, how can I reach you ?
thanks so much for your interesting in this AI learning project.
this AI learning project focus on/powered by GGML(pls refer to background introduction of this PoC to get the reason) and focus on three heavyweight AI applications on Android device: whisper.cpp, llama.cpp, stablediffusion.cpp and try to uniform them to an ONE AI application(this project).
there are some tasks in to-do list currently:
Xiaomi 14 or other Qualcomm Snapdragon 8 Gen 3 SoC based Android phone is preferred for this PoC or this AI learning project.
thanks again.
core implementation(data path works as expected with whisper.cpp using QNN CPU backend) has been completed on 04/13/2024, the main part of this PoC(Proof of Concept, not a commercial product) has been completed
core implementation(data path works fine as expected with whisper.cpp using QNN CPU/GPU backend) has been completed on 04-17-2024
core implementation(data path works fine as expected with whisper.cpp using QNN HTP(aka DSP) backend) has been completed on 04-19-2024
validate with llama.cpp using QNN CPU/GPU/HTP(aka DSP) backend on Xiaomi 14 and works fine as expceted on 04/24/2024
implementation of GGML_OP_ADD/ GGML_OP_MUL/ GGML_OP_MUL_MAT using QNN API could be found in ggml-qnn.cpp
(updated on 04-17-2024)codes in UI layer & JNI layer was refined, now the UI is more make sense / more clear to user(I hope no confusion for user from now on)
a simple UT framework was added for "PoC-S49: implementation of GGML OPs using QNN API"
thanks to the great/breakthrough help from @chiwwang(a technical expert from Qualcomm), thanks to the highly well-designed Qualcomm's QNN SDK(I personally think it's really match with GGML), thanks to the excellent implementation of Intel SYCL backend of GGML(I really learned a lot from this implementation which come from a great open-mind company)... so this PoC could be done from 03-29-2024 to 04-22-2024 --- something I did not expect at all at the very beginning of this PoC. at the end of this section, I'd like to express my sincerely thanks to the original author(authors) of the great GGML/whisper.cpp because I'm a real AI beginner and learned a lot of interesting things from these two great open-source C/C++ AI projects.
all the source codes(no any reserved in local) in this PoC could be found at branch "kantv-poc-with-qnn" and hope it's a little useful/helpful for programmers like me(know nothing about real AI tech)because I'm an elder programmer who does not belong to this great era/2020s. the codes of this PoC will be merged to master branch in next few days because I think it's stable enough to be merged into master branch.
header file of implementation of Qualcomm's QNN backend of GGML source file of implementation of Qualcomm's QNN backend of GGML
高手,在高通8gen3上,llama2-7B推理性能如何? 多少token/s ?
高手,在高通8gen3上,llama2-7B推理性能如何? 多少token/s ?
谢谢。
性能不错,大概20 tokens/s,这还没有用到高通的硬件AI加速引擎。
anyway,衷心希望高通亲自下场类似Intel那样投人投钱投资源做SYCL后端一样做QNN后端:这样大家都省心了,直接拿来用就可以了。
感谢天才程序员Georgi Gerganov 为程序员为人类带来了GGML(同时感谢设计精巧的ggml backend subsystem的作者,虽然对其某些似乎固执的决策有所保留,但其设计精巧的backend subsystem给一个不懂硬核AI技术的我有了一点点发挥的空间)。Georgi Gerganov的确是自70后(公开资料)FFmpeg原始作者Fabrice Bellard之后的又一位来自欧洲的充满理想主义精神的90后(推测)天才程序员。
Llama 7b 有20token 太棒了!
使用llama.cpp在8gen3的cpu上4线程推理llama2-7B最多只有5token
Llama 7b 有20token 太棒了!
使用llama.cpp在8gen3的cpu上4线程推理llama2-7B最多只有5token
您的数据没问题,之前是我记错了,向您致歉。
Google的gemma在小米14上可以跑到20 tokens/s,纯CPU,没有用到高通后端,贴一幅4月24日的截图供参考。
目前的QNN后端只是一个基本雏形(data path跑通了),在小米14上whisper.cpp的测试性能赶不上纯CPU推理。个人的理解:高通花费了很大力气构建的QNN(aka AI Direct) SDK是AI时代高通平台上的闭源FFmpeg,需要程序员花费精力去研究如何正确高效的使用以充分发挥高通平台各个计算单元(异构多核)的硬件加速能力。本来打算再接再厉花费精力将ggml的QNN后端不断完善在社区的共同努力下最终接近产品质量的。anyway, I don't care it(whether the PR be approved by upstream community although I hope it becomes true) at the moment.
敬佩!
敬佩!
谢谢。贵公司还有您个人的几个开源项目做的非常好。
敬佩!
5月份开始认真学习ncnn后才发现您是腾讯的高级研究员与工业界大名鼎鼎的AI专家。
的确太孤陋寡闻了:ncnn 2017年就开源了,2024年4月底才偶然看到。
这个AI学习项目使用/复用/参考了您的很多ncnn相关example代码,非常感谢!
Background of this PoC:
1.GGML is a very compact/highly optimization pure C/C++ machine learning library. GGML is also the solid cornerstone of the amazing whisper.cpp and the magic llama.cpp. Compared to some well-known machine learning frameworks/libraries (e.g. Google TensorFlow, Microsoft ONNX, Meta PyTorch, Baidu PaddlePaddle......), GGML does not have much/complex/complicated/redundant/… encapsulation, so it's very very very useful/helpful/educational for AI beginner(such as me). <!--By studying the internals of GGML, you will know what real AI is and what is behind LLM models and how these AI models work in less than 1 month(less than 2 weeks if your IQ is above 130 ------ the baseline IQ to enter China's top university(China has a population of 1.4 billion and there are only about 50 top domestic universities), less than 1 week if your IQ is above 150 ------ original author of FFmpeg, original author of TensorFlow, original author of TVM, original author of Caffe, original author of GGML). Another thing, tracking code and coding with the GGML API in real task is a good way to study the internals of GGML.--> In general, GGML has following features:
Written in C
16-bit float support
Integer quantization support (4-bit, 5-bit, 8-bit, etc.)
Automatic differentiation
ADAM and L-BFGS optimizers
Optimized for Apple Silicon
On x86 architectures utilizes AVX / AVX2 intrinsics
On ppc64 architectures utilizes VSX intrinsics
No third-party dependencies
Zero memory allocations during runtime
all in one source file and similar to imgui(this is just personal opinion and I really like it, this is a NEW coding style and the requirements for programmer is relatively high and very helpful for experienced programmer, this coding style may not be acceptable in large commercial IT companies because it violates some principles of modern software engineering)
There are four "killer/heavyweight" AI applications based on GGML:
Audio2Text(aka ASR) whisper.cpp
Running LLM locally llama.cpp
Text2Image stable-diffusion.cpp
Text2Speech(aka TTS) bark.cpp
There are also some open source C/C++ open source AI projects/examples based on GGML:
[X] Example of GPT-2 inference examples/gpt-2
[X] Example of GPT-J inference examples/gpt-j
[X] Example of LLaMA training ggerganov/llama.cpp/examples/baby-llama
[X] Example of Falcon inference cmp-nct/ggllm.cpp
[X] Example of BLOOM inference NouamaneTazi/bloomz.cpp
[X] Example of RWKV inference saharNooby/rwkv.cpp
[x] Example of SAM inference examples/sam
[X] Example of BERT inference skeskinen/bert.cpp
[X] Example of BioGPT inference PABannier/biogpt.cpp
[X] Example of Encodec inference PABannier/encodec.cpp
[X] Example of CLIP inference monatis/clip.cpp
[X] Example of MiniGPT4 inference Maknee/minigpt4.cpp
[X] Example of ChatGLM inference li-plus/chatglm.cpp
[X] Example of Qwen inference QwenLM/qwen.cpp
[X] Example of YOLO inference examples/yolo
[X] Example of ViT inference staghado/vit.cpp
[X] Example of multiple LLMs inference foldl/chatllm.cpp
Xiaomi 14 was released in China on 10-26-2023 by one of China’s largest mobile phone giants, Xiaomi 14 was available in Euro since 02-25-2024. Xiaomi 14 contains a very very very powerful mobile SoC------Qualcomm SM8650-AB Snapdragon 8 Gen 3 (4 nm).
Qualcomm is No.1 mobile SoC semiconductor company in our planet currently(MediaTek's market share is No.1 in Q1 2024 but I personally think Qualcomm is the real No.1 mobile SoC vendor in our planet). QNN(Qualcomm Neural Network, aka Qualcomm AI Engine Direct) SDK is verified to work with the following versions of the ML frameworks:
after spent 3 days(from 03-26-2024 to 03-28-2024) on llama.cpp
I want to add Qualcomm mobile SoC native backend for GGML for personal interest or purpose of study AI/machine learning( and study internal mechanism of GGML).
This PoC is a much more difficult task for me because it's my first time using Qualcomm QNN SDK and I don't know anything about real/hardcore AI /machine learning tech.
I'm not sure I can do it this time but I just want to try(and practice my C/C++ programming and troubleshooting skill). so there is NO timeline in this PoC(might be break at any point in the future).
so the integration work of SYCL will provide a huge/significant reference for this POC:we can learn something from what the Intel R&D team has done with ggml-sycl(ggml-sycl.h, ggml-sycl.cpp).
PR or any help(from AI expert, from upstream GGML community, from Qualcomm, from Xiaomi(the most important customer of Qualcomm in China)......) are both greatly welcomed. Guidance from domain-expert is greatly appreciated if there is a problem in path/direction of this PoC.
All codes in this PoC will be open-sourced in this project and want to be submitted to upstream GGML community as ggml-qnn.h&ggml-qnn.cpp if it's considered or accepted.