zhouwg / kantv

workbench for learing&practising AI tech in real scenario on Android device, powered by GGML(Georgi Gerganov Machine Learning) and NCNN(Tencent NCNN) and FFmpeg
Apache License 2.0
89 stars 16 forks source link

PoC: Add Qualcomm mobile SoC native backend for GGML #121

Closed zhouwg closed 1 month ago

zhouwg commented 3 months ago

Background of this PoC:

1.GGML is a very compact/highly optimization pure C/C++ machine learning library. GGML is also the solid cornerstone of the amazing whisper.cpp and the magic llama.cpp. Compared to some well-known machine learning frameworks/libraries (e.g. Google TensorFlow, Microsoft ONNX, Meta PyTorch, Baidu PaddlePaddle......), GGML does not have much/complex/complicated/redundant/… encapsulation, so it's very very very useful/helpful/educational for AI beginner(such as me). <!--By studying the internals of GGML, you will know what real AI is and what is behind LLM models and how these AI models work in less than 1 month(less than 2 weeks if your IQ is above 130 ------ the baseline IQ to enter China's top university(China has a population of 1.4 billion and there are only about 50 top domestic universities), less than 1 week if your IQ is above 150 ------ original author of FFmpeg, original author of TensorFlow, original author of TVM, original author of Caffe, original author of GGML). Another thing, tracking code and coding with the GGML API in real task is a good way to study the internals of GGML.--> In general, GGML has following features:

  1. Xiaomi 14 was released in China on 10-26-2023 by one of China’s largest mobile phone giants, Xiaomi 14 was available in Euro since 02-25-2024. Xiaomi 14 contains a very very very powerful mobile SoC------Qualcomm SM8650-AB Snapdragon 8 Gen 3 (4 nm).

  2. Qualcomm is No.1 mobile SoC semiconductor company in our planet currently(MediaTek's market share is No.1 in Q1 2024 but I personally think Qualcomm is the real No.1 mobile SoC vendor in our planet). QNN(Qualcomm Neural Network, aka Qualcomm AI Engine Direct) SDK is verified to work with the following versions of the ML frameworks:


  1. As well known,Apple's dedicated machine learning acceleration library(ANE) is very important for performance of ggml/llama.cpp on iOS/Mac.

1277368600


  1. This PoC is similar to an opening issue in upstream GGML:https://github.com/ggerganov/ggml/issues/771:
  • Adding Native Support of SYCL for Intel GPUs https://github.com/ggerganov/llama.cpp/issues/4749
  • SYCL backend support Multi-card https://github.com/ggerganov/llama.cpp/issues/5282
  • so the integration work of SYCL will provide a huge/significant reference for this POC:we can learn something from what the Intel R&D team has done with ggml-sycl(ggml-sycl.h, ggml-sycl.cpp).


    1. PR or any help(from AI expert, from upstream GGML community, from Qualcomm, from Xiaomi(the most important customer of Qualcomm in China)......) are both greatly welcomed. Guidance from domain-expert is greatly appreciated if there is a problem in path/direction of this PoC.

      All codes in this PoC will be open-sourced in this project and want to be submitted to upstream GGML community as ggml-qnn.h&ggml-qnn.cpp if it's considered or accepted.

    zhouwg commented 3 months ago

    breakdown task(gradual refinement through Agile method and any help(from AI expert, from upstream GGML community, from Qualcomm, ......) are both greatly welcomed. Guidance from domain-expert is greatly appreciated if there is a problem in path/direction of breakdown task):

    zhouwg commented 3 months ago

    background study:

    1. QNN SDK:

    there are two types of AI SDK provided by Qualcomm, the QNN (aka Qualcomm® AI Engine Direct) SDK will be used in this PoC(guidance from domain-expert is greatly appreciated if there is a problem in path/direction here).

    qualcomm-ai-software-stack

    Screenshot from 2024-04-14 11-42-14

    Screenshot from 2024-04-14 14-19-01 https://developer.qualcomm.com/sites/default/files/attachments/qnn_software_stack.png

    https://qpm.qualcomm.com/#/main/tools/details/qualcomm_ai_engine_direct

    https://developer.qualcomm.com/software/hexagon-dsp-sdk/tools

    https://developer.qualcomm.com/forum/qdn-forums/software/hexagon-dsp-sdk/toolsinstallation/70818

    https://www.qualcomm.com/agreements

    https://developer.qualcomm.com/forum/qdn-forums/software/qualcomm-neural-processing-sdk/71578 (the difference between SNPE and QNN)

    https://docs.qualcomm.com/bundle/publicresource/topics/80-64748-1

    QNN SDK online SDM(software developer's manual): https://docs.qualcomm.com/bundle/publicresource/topics/80-63442-50/introduction.html

    Inference at the edge: https://github.com/ggerganov/llama.cpp/discussions/205

    backend : offload large batches to GPU https://github.com/ggerganov/llama.cpp/pull/6083

    explanation of k-quants https://github.com/ggerganov/llama.cpp/pull/1684

    zhouwg commented 3 months ago

    updated on 03-30-2024,19:14, PoC-S21 & PoC-S22 done

    did NOT touch anything core stuff in this commit just an integration/troubleshooting work. challenge might be happened in next step.

    1372217447

    zhouwg commented 3 months ago

    updated on 03-31-2024,20:21, PoC-S23 & PoC-S25 done

    did NOT touch anything core stuff in this commit just an integration work. challenge might be happened in next step.
    but I suddenly got an inspiration/idea cause of my streaming-media background after study SDM(software developer's manual) of QNN SDK: whether the scenario here is similar to hardware video decoding via OpenMAX IL directly or Android MediaCodec indirectly? or similar to hardware video decrypt in Widevine L1?

    957581016

    zhouwg commented 3 months ago

    updated on 04-03-2024,20:38

    blocked on PoC-S26:offload simple f32 2x2 matrix addition operation to QNN CPU backend.

    external help from domain expert is greatly welcomed and appreciated.


    updated on 04-02-204, domain technical expert from Qualcomm:

    @quic, @chunit-quic, @haowhsu-quic, @chiwwang, @shewu-quic, @quic-zhanweiw,

    updated on 04-03-2024, domain technical expert from Qualcomm:

    @quic-ppant, @mynameistechno,

    I'm sorry to interrupt you, could you help take a look? this is an Android turn-key project on Qualcomm SoC based Android phone(Xiaomi 14 --- Snapdragon 8 Gen 3 --- is preferred) and easily to reproduce the issue. thanks so much.

    my output/progress on 04/01/2024 - 04/03/2024 could be found here, FYI:

    https://github.com/zhouwg/kantv/blob/kantv-poc-with-qnn/core/ggml/jni/ggml-qnn.cpp https://github.com/zhouwg/kantv/blob/kantv-poc-with-qnn/core/ggml/jni/Inception_v3.cpp

    Screenshot from 2024-04-03 20-38-25

    chiwwang commented 3 months ago

    In general I suggest to ask questions on Qualcomm forum. Nonetheless, it's very possible that some fields are wrong in OpConfig structure so it's rejected by QNN CPU.

    Instead of using QNN APIs directly, I suggest to consider SNPE, QNN TFLiteDelegate, onnxruntime-qnn-ep or Executorch. They provides higher level APIs and should be friendlier.

    zhouwg commented 3 months ago

    In general I suggest to ask questions on Qualcomm forum. Nonetheless, it's very possible that some fields are wrong in OpConfig structure so it's rejected by QNN CPU.

    Instead of using QNN APIs directly, I suggest to consider SNPE, QNN TFLiteDelegate, onnxruntime-qnn-ep or Executorch. They provides higher level APIs and should be friendlier.

    Appreciate too much for your help and guidance.

    I'm an individual freelancer Android system software programmer(not a company employee) and want to adding Qualcomm backend for GGML for personal interest or purpose of study. The background of this POC could be found at the beginning of this page.

    GGML is a compact(without much encapsulation) and powerful machine learning library for machine learning beginners and C/C++ programmers which come from open source community . There is no Qualcomm's official backend for GGML currently, so QNN SDK is preferred for this PoC because QNN SDK is a very low-level userspace API without much encapsulation and it's API is stable.

    This PoC already referenced a lot from Executorch and QNN sample:

    https://github.com/zhouwg/kantv/blob/kantv-poc-with-qnn/core/ggml/jni/ggml-qnn.cpp#L8

    chiwwang commented 3 months ago

    In general I suggest to ask questions on Qualcomm forum. Nonetheless, it's very possible that some fields are wrong in OpConfig structure so it's rejected by QNN CPU.

    Instead of using QNN APIs directly, I suggest to consider SNPE, QNN TFLiteDelegate, onnxruntime-qnn-ep or Executorch. They provides higher level APIs and should be friendlier.

    Appreciate too much for your help and guidance.

    I'm an individual freelancer Android system software programmer(not a company employee) and want to adding Qualcomm backend for GGML for personal interest or purpose of study. The background of this POC could be found at the beginning of this page.

    GGML is a compact(without much encapsulation) and powerful machine learning library for machine learning beginners and C/C++ programmers which come from open source community . There is no Qualcomm's official backend for GGML currently, so QNN SDK is preferred for this PoC because QNN SDK is a very low-level userspace API without much encapsulation and it's API is stable.

    This PoC already referenced a lot from Executorch and QNN sample:

    https://github.com/zhouwg/kantv/blob/kantv-poc-with-qnn/core/ggml/jni/ggml-qnn.cpp#L8

    I see. Thank you for interest in QNN.

    I assume you have QNN SDK and documents. A way we can debug this kind of error is to compare what we filled into OpConfig with what we know to work. For example, you can create a single layer matmul and use ExecuTorch to delegate it. After making sure it works, we can turn on QNN saver by setting this option: https://github.com/pytorch/executorch/blob/399482c333dea26500bd79956f0e1299f803a056/backends/qualcomm/utils/utils.py#L196

    Instead of compiling out a binary, you will see a saver_output.c, which is how QNN C-API is used to create the working binary. Then we can check how OpConfig is filled.

    However, I don't really recommend to reference ExecuTorch codes here... Executorch is in Python and it has quite a few interesting properties, which buries QNN quite deeply. (At least they can distract you from what is the real QNN part.)

    Instead of ExecuTorch, I recommend to learn QNN APIs by QNN Converter plus the Saver backend mentioned above. You can find how to use them in QNN docs. I recommend to use qnn-onnx-converter since ONNX is also an easy-to-read format. Then we can compare the source model, the converted .cpp, and the saver_output.c, which should help us a lot about understanding QNN APIs.

    zhouwg commented 3 months ago

    @chiwwang, thanks for your time and appreciate so much for your guidance. your guidance is really helpful. I got the point, now the QNN pipeline works.

    thanks so much. words can not express my sincerely thanks for great help from you.

    1075271895

    there is another issue:I don't know why compute result is incorrect.

    https://github.com/zhouwg/kantv/blob/kantv-poc-with-qnn/core/ggml/jni/ggml-qnn.cpp#L2650

    zhouwg commented 3 months ago

    @chiwwang, thanks so much for your great help and guidance again. I think I understand a little bit more about the QNN SDK and what's the tensor and what's the computation graph at the moment. your guidance is very important for this progress.

    now PoC-S26 finished: offload a simple f32 2x2 matrix to QNN CPU backend. it's a really milestone in this PoC.

    this is a workaround method, not perfect(this method is a very straight way(just like the great GGML --- without much encapsulation compare to other famous/huge machine learning framework/library) and will be used in PoCS42---PoCS44, but there are some unsolved problems in function qnn_matrix, so I think it's NOT perfect in this commit).

    2078758147

    zhouwg commented 3 months ago

    updated on 04-08-2024(April-08-2024) with this commit:

    PoC-S27 finished: offload a simple f32 2x2 matrix addition operation to QNN GPU backend(thanks to the highly well-designed QNN SDK from Qualcomm) 278493012

    PoC-S29&S30&32&33 finished: mapping ggml_tensor with add/mul/mulmat operation to QNN tensor with QNN CPU/GPU backend

    1396750962 652831785 207374942 43145139

    1421179163 1720959012

    zhouwg commented 2 months ago

    updated on 04-09-2024, a minor refinement:

    2116653310 6743352

    355013320

    zhouwg commented 2 months ago

    updated on 04-10-2024, a minor refinement: now the APK could works well/running well on any mainstream Qualcomm mobile SoC base Android phone, NOT limited on Xiaomi 14 or Qualcomm Snapdragon 8 Gen 3 SoC based Android phone.

    (the below test phone is Xiaomi 14 which contains a Qualcomm's high-end Snapdragon 8 Gen 3 mobile SoC) 877281587

    (the price of below test phone is about RMB 800-1200 (or USD 120 - 166, the price was RMB 1200 when I purchased it) which contains a Qualcomm's low-end mobile SoC. there is a minor UI adjustment because of screen width and height is not enough on this test phone) Screenshot_2024_0410_154952

    Screenshot_2024_0410_155316

    tongji1907 commented 2 months ago

    i am interested in your project , and I base in Shanghai, how can I reach you ?

    zhouwg commented 2 months ago

    i am interested in your project , and I base in Shanghai, how can I reach you ?

    thanks so much for your interesting in this AI learning project.

    this AI learning project focus on/powered by GGML(pls refer to background introduction of this PoC to get the reason) and focus on three heavyweight AI applications on Android device: whisper.cpp, llama.cpp, stablediffusion.cpp and try to uniform them to an ONE AI application(this project).

    there are some tasks in to-do list currently:

    Xiaomi 14 or other Qualcomm Snapdragon 8 Gen 3 SoC based Android phone is preferred for this PoC or this AI learning project.

    thanks again.

    zhouwg commented 2 months ago
    zhouwg commented 2 months ago
    zhouwg commented 2 months ago

    Summary of this PoC

    Todo(improve the quality of Qualcomm QNN backend for GGML)

    Highlights in 2nd milestone

    a complicated computation graph using QNN API(a reverse engineering implementation by Project KanTV)
      ![Screenshot_2024_0414_113706](https://github.com/zhouwg/kantv/assets/6889919/9dc327f1-f4ea-4d1d-ad18-61e891c00cb6)
    (1)data path of GGML's QNN(CPU&GPU) backend (only support GGML_OP_ADD,GGML_OP_MUL, GGML_OP_MUL_MAT) works fine/well as expected with whisper.cpp on Qualcomm's SoC based Android phone(from low-end to high-end, or from RMB 800 - RMB 8000 , or from USD 110 - USD 1100). (2)data path of GGML's QNN (CPU&GPU& HTP(aka DSP) backend(only support GGML_OP_ADD, GGML_OP_MUL, GGML_OP_MUL_MAT) works fine/well as expected with whisper.cpp on Qualcomm's SoC based high-end phone(Xiaomi 14). (3)data path of GGML's QNN (CPU&GPU& HTP(aka DSP) backend(only support GGML_OP_ADD, GGML_OP_MUL, GGML_OP_MUL_MAT) works fine/well as expected with llama.cpp on Qualcomm's SoC based high-end phone(Xiaomi 14). in other words, it's a real Qualcomm's QNN backend of GGML although it's not perfect .
      ![Screenshot_2024_0417_133635](https://github.com/zhouwg/kantv/assets/6889919/38ae8121-7425-49c1-94dd-a912032682b8) ![Screenshot_2024_0417_133531](https://github.com/zhouwg/kantv/assets/6889919/5cb594a4-038e-4c15-8b11-b19c1f54be16) ![Screenshot_2024_0417_133240](https://github.com/zhouwg/kantv/assets/6889919/b9fe8775-04da-44d4-8e8a-342301fb119c) ![Screenshot from 2024-04-17 19-52-58](https://github.com/zhouwg/kantv/assets/6889919/80be637b-c1b9-45dc-a1c8-477fe864aea7) ![426090730](https://github.com/zhouwg/kantv/assets/6889919/dcffc309-a4c3-4606-8536-d8ad167da76f) ![504893116](https://github.com/zhouwg/kantv/assets/6889919/51f0b277-eca4-4938-86f5-415dbf5897e7)
    a simple UT framework(so-called) was added for PoC-S49:implementation of other GGML OPs using QNN API
      ![Screenshot_2024_0418_131905](https://github.com/zhouwg/kantv/assets/6889919/b53d5309-8c25-437e-9e7f-71c12d61f013) ![Screenshot_2024_0418_131835](https://github.com/zhouwg/kantv/assets/6889919/b8f4bee6-01ba-4ea4-8734-396b18017425) ![Screenshot_2024_0418_131803](https://github.com/zhouwg/kantv/assets/6889919/e998f61a-6e33-412a-bde9-432bf183dcb4) ![Screenshot_2024_0418_131723](https://github.com/zhouwg/kantv/assets/6889919/e1dde47d-cdae-4ece-88ec-653383712670) ![Screenshot_2024_0418_131215](https://github.com/zhouwg/kantv/assets/6889919/8e6d532e-7c7a-4265-b41a-1a7e7f51fc8a) ![Screenshot_2024_0418_131018](https://github.com/zhouwg/kantv/assets/6889919/eeb9e72b-71b4-4932-809b-8b49f21ec29f) ![940276724](https://github.com/zhouwg/kantv/assets/6889919/ad097129-35c5-4118-b8c2-c420f4910888)
    4x performance gains for GGML_OP_MUL_MAT using QNN CPU backend with 1 thread on a Qualcomm mobile SoC based high-end Android phone(Xiaomi 14)
      ![1922265373](https://github.com/zhouwg/kantv/assets/6889919/d83ed630-d105-4818-9870-8cf539446c75) ![250505401](https://github.com/zhouwg/kantv/assets/6889919/cc3f450a-baac-4bce-be8c-0b5d1ea76efa)

    Acknowledgements

    thanks to the great/breakthrough help from @chiwwang(a technical expert from Qualcomm), thanks to the highly well-designed Qualcomm's QNN SDK(I personally think it's really match with GGML), thanks to the excellent implementation of Intel SYCL backend of GGML(I really learned a lot from this implementation which come from a great open-mind company)... so this PoC could be done from 03-29-2024 to 04-22-2024 --- something I did not expect at all at the very beginning of this PoC. at the end of this section, I'd like to express my sincerely thanks to the original author(authors) of the great GGML/whisper.cpp because I'm a real AI beginner and learned a lot of interesting things from these two great open-source C/C++ AI projects.

    all the source codes(no any reserved in local) in this PoC could be found at branch "kantv-poc-with-qnn" and hope it's a little useful/helpful for programmers like me(know nothing about real AI tech)because I'm an elder programmer who does not belong to this great era/2020s. the codes of this PoC will be merged to master branch in next few days because I think it's stable enough to be merged into master branch.

    implementation of PoC-S2 & S3:"play with / say hello to" Qualcomm's QNN SDK and study internal detail of GGML

    header file of implementation of Qualcomm's QNN backend of GGML source file of implementation of Qualcomm's QNN backend of GGML

    caofx0418 commented 2 months ago

    高手,在高通8gen3上,llama2-7B推理性能如何? 多少token/s ?

    zhouwg commented 2 months ago

    高手,在高通8gen3上,llama2-7B推理性能如何? 多少token/s ?

    谢谢。

    性能不错,大概20 tokens/s,这还没有用到高通的硬件AI加速引擎。

    anyway,衷心希望高通亲自下场类似Intel那样投人投钱投资源做SYCL后端一样做QNN后端:这样大家都省心了,直接拿来用就可以了

    感谢天才程序员Georgi Gerganov 为程序员为人类带来了GGML(同时感谢设计精巧的ggml backend subsystem的作者,虽然对其某些似乎固执的决策有所保留,但其设计精巧的backend subsystem给一个不懂硬核AI技术的我有了一点点发挥的空间)。Georgi Gerganov的确是自70后(公开资料)FFmpeg原始作者Fabrice Bellard之后的又一位来自欧洲的充满理想主义精神的90后(推测)天才程序员。

    caofx0418 commented 2 months ago

    Llama 7b 有20token 太棒了!

    使用llama.cpp在8gen3的cpu上4线程推理llama2-7B最多只有5token

    zhouwg commented 2 months ago

    Llama 7b 有20token 太棒了!

    使用llama.cpp在8gen3的cpu上4线程推理llama2-7B最多只有5token

    您的数据没问题,之前是我记错了,向您致歉。 167078394

    1379554357

    Google的gemma在小米14上可以跑到20 tokens/s,纯CPU,没有用到高通后端,贴一幅4月24日的截图供参考。

    504893116

    目前的QNN后端只是一个基本雏形(data path跑通了),在小米14上whisper.cpp的测试性能赶不上纯CPU推理。个人的理解:高通花费了很大力气构建的QNN(aka AI Direct) SDK是AI时代高通平台上的闭源FFmpeg,需要程序员花费精力去研究如何正确高效的使用以充分发挥高通平台各个计算单元(异构多核)的硬件加速能力。本来打算再接再厉花费精力将ggml的QNN后端不断完善在社区的共同努力下最终接近产品质量的。anyway, I don't care it(whether the PR be approved by upstream community although I hope it becomes true) at the moment.

    nihui commented 2 months ago

    敬佩!

    zhouwg commented 2 months ago

    敬佩!

    谢谢。贵公司还有您个人的几个开源项目做的非常好。

    zhouwg commented 1 month ago

    敬佩!

    5月份开始认真学习ncnn后才发现您是腾讯的高级研究员与工业界大名鼎鼎的AI专家。

    的确太孤陋寡闻了:ncnn 2017年就开源了,2024年4月底才偶然看到。

    这个AI学习项目使用/复用/参考了您的很多ncnn相关example代码,非常感谢!