为什么chatglm2-6b在P40,cuda 12.1的环境下fastllm加速后performance测试的速度非常低，只有8 tokens / s

heavenkiller2018 commented 1 year ago

测试结果: int4量化,1 batch 的速度是8 tokens / s, 只有4090的1/20?🤡🤡🤡 而且fp16的1batch速度反而比int 1batch的速度还高，不是应该要低的吗？另外，16batch的速度都要远低于1batch的速度。这测试结果有点看不懂了，一是为什么16batch比1batch速度反而低, 二是为啥fp16的速度反而比int4速度高, 三是P40的速度为啥只有4090的1/20，两者性能是有差距，但不至于这么大吧？@ztxz16 , 这是哪个环节出了问题了吗？是GPU卡, 模型，还是fastllm中的哪一个出问题了呢？

测试标准:

模型 | Data精度 | 平台 | Batch | 最大推理速度(token / s) -- | -- | -- | -- | -- ChatGLM-6b-int4 | float32 | RTX 4090 | 1 | 176 ChatGLM-6b-int8 | float32 | RTX 4090 | 1 | 121 ChatGLM-6b-fp16 | float32 | RTX 4090 | 64 | 2919 ChatGLM-6b-fp16 | float32 | RTX 4090 | 256 | 7871 ChatGLM-6b-fp16 | float32 | RTX 4090 | 512 | 10209 ChatGLM-6b-int4 | float32 | Xiaomi 10 Pro - 4 Threads | 1 | 4 ~ 5

测试环境:

GPU: Tesla P40
NVIDIA-SMI 530.30.02              
Driver Version: 530.30.02    
CUDA Version: 12.1
Python 3.9.17

测试数据:

int4_1

./benchmark -p /home/john/tmp/chatglm2-6b-int4.flm -f ../example/benchmark/prompts/beijing.txt -b 1

Load (200 / 200)
Warmup...
finish.
AVX: ON
AVX2: ON
AARCH64: OFF
Neon FP16: OFF
Neon DOT: OFF
[ user: "[Round 0]
问：北京有什么景点？
答：", model: " 北京是一个历史悠久、文化底蕴深厚的城市,有许多著名的景点和历史遗迹。以下是一些著名的北京景点:

1. 故宫博物院:故宫是中国明清两朝的皇宫,也是现在的博物馆,收藏着大量的历史文物和艺 术品。

2. 天安门广场:这是中国的国家象征之一,也是世界上最大的城市广场之一,每天都会有许多 人在这里集会、游行和庆祝。

3. 颐和园:这是一个古老的皇家园林,被誉为“万园之园”,是一个非常适合漫步和休闲的地方 。

4. 长城:这是中国最著名的古迹之一,也是世界七大奇迹之一,可以在不同的地点欣赏到美丽 的风景和山脉。

5. 北海公园:这是一个美丽的园林,包括北海、静安寺、中轴线和白塔等景点,是一个放松和 休闲的好地方。

6. 北京鸟巢和水立方:这是2008年北京奥运会的主场馆和游泳馆,是世界上最先进的体育场馆之一。

7. 天坛公园:这是一个古老的皇家祭祀园林,是中国最著名的祭祀建筑之一,也是世界文化遗 产。

8. 北京动物园:这是一个大型的动物园,拥有大量的动物品种,包括熊猫、大象、老虎和长颈 鹿等。

9. 王府井大街:这是北京最著名的商业街之一,有很多商店、餐厅和旅游景点,是一个非常适 合购物和探索的地方。

10. 景山公园:这是一个美丽的公园,包括景山、北京动物园和北京植物园等景点,适合欣赏城市美景和自然景观。"]
batch: 1
output 325 tokens
use 38.638680 s
speed = 8.411261 tokens / s

int4_16

./benchmark -p /home/john/tmp/chatglm2-6b-int4.flm -f ../example/benchmark/prompts/hello.txt -b 16 -l 18

batch: 16
output 144 tokens
use 22.783129 s
speed = 6.320467 tokens / s

int4_512

$ ./benchmark -p /home/john/tmp/chatglm2-6b-int4.flm -f ../example/benchmark/prompts/hello.txt -b 512 -l 18

[ user: "[Round 0]
问：Hello！
答：", model: " Hello! How can I assist you today?"]

batch: 512
output 4608 tokens
use 1299.657227 s
speed = 3.545550 tokens / s

fp16_1

./benchmark -p /home/john/tmp/chatglm2-6b-fp16.flm -f ../example/benchmark/prompts/beijing.txt -b 1

batch: 1
output 265 tokens
use 16.795168 s
speed = 15.778348 tokens / s

./benchmark -p /home/john/tmp/chatglm2-6b-fp16.flm -f ../example/benchmark/prompts/hello.txt -b 16 -l 18

batch: 16
output 144 tokens
use 33.888264 s
speed = 4.249259 tokens / s

Justin18Chan commented 1 year ago

[root@localhost build]# ./benchmark -p /root/ChatGLM2-6B/deploy/flm/chatglm26b_lora/chatglm26b_fp16.flm -f ../example/benchmark/prompts/hello.txt -b 16 -l 18 Load (200 / 200) Warmup... finish. AVX: ON AVX2: ON AARCH64: OFF Neon FP16: OFF Neon DOT: OFF

batch: 16 output 288 tokens use 0.520529 s speed = 553.283325 tokens / s

这是我A40跑的，好像速度还可以

sun1092469590 commented 1 year ago

[root@localhost build]# ./benchmark -p /root/ChatGLM2-6B/deploy/flm/chatglm26b_lora/chatglm26b_fp16.flm -f ../example/benchmark/prompts/hello.txt -b 16 -l 18 Load (200 / 200) Warmup... finish. AVX: ON AVX2: ON AARCH64: OFF Neon FP16: OFF Neon DOT: OFF

batch: 16 output 288 tokens use 0.520529 s speed = 553.283325 tokens / s

这是我A40跑的，好像速度还可以

请教下哈，同样的batch和A40配置，不使用加速时速度是多少？加速明显吗？
你这个是batch=16的速度，折算成batch=1，速度是34 token/s？

HongyuJiang commented 1 year ago

不要去折算，batch=1意思是一次性喂给模型一个输入，batch=16意思是一次性喂给模型16个输入。由于batch=1时模型没有满载，所以速度不能直接折算。@sun1092469590

ztxz16 commented 1 year ago

我没有P40，猜测可能是没有int4计算单元

HongyuJiang commented 1 year ago

@heavenkiller2018 遇到了同样的问题，使用作者一级目录下readme推荐的模型加载方法速率没有提升反而下降了

from fastllm_pytools import llm
model = llm.model("model.flm")

但是使用二级目录pyfastllm中readme推荐的方法，速度有质的提升（Batch 1时速度提升了1倍，Batch 200时速度提升了2倍，3090 Batch 200时最高生成速度2500 token/s）

sys.path.append('./build-py')
import pyfastllm # 或fastllm

建议作者优化一下项目组织，解释一下两种方法的差异性测试环境如下 torch: 2.1.0.dev20230718+cu121 cuda: 12.2 GPU: 3090

sun1092469590 commented 1 year ago

不要去折算，batch=1意思是一次性喂给模型一个输入，batch=16意思是一次性喂给模型16个输入。由于batch=1时模型没有满载，所以速度不能直接折算。@sun1092469590

这样啊，好的好的，谢谢哈

TylunasLi commented 1 year ago

@heavenkiller2018 遇到了同样的问题，使用作者一级目录下readme推荐的模型加载方法速率没有提升反而下降了
from fastllm_pytools import llm
model = llm.model("model.flm")
但是使用二级目录pyfastllm中readme推荐的方法，速度有质的提升（Batch 1时速度提升了1倍，Batch 200时速度提升了2倍，3090 Batch 200时最高生成速度2500 token/s）
sys.path.append('./build-py')
import pyfastllm # 或fastllm
建议作者优化一下项目组织，解释一下两种方法的差异性测试环境如下 torch: 2.1.0.dev20230718+cu121 cuda: 12.2 GPU: 3090

我的理解，不一定对哈。第一种方式只用于没有编译pyfastllm时，python加载动态库libfastllm_tools.so转换模型，这种加载方式存在内存拷贝，性能会打折扣。第二种通过pybind11创建适用python的API，性能没有因为调用方式受到影响。

HL0718 commented 1 year ago

我想请问一下，python cli.py -m chatglm -p chatglm-6b-int8.bin 这一步是在哪个目录下执行的，我先是在fastllm下创建了build-py的目录，然后把cd build-py cmake .. -DUSE_CUDA=ON -DPY_API=ON make -j4 这些都执行了，然后去到pyfastllm/demo下执行 python cli.py -m chatglm -p chatglm-6b-int8.bin报错了， No module named 'pyfastllm',可以给一下详细的步骤吗

heavenkiller2018 commented 1 year ago

使用如下命令安装fastllm_pytools包

cd fastllm
mkdir build
cd build
cmake .. -DUSE_CUDA=ON # 如果不使用GPU编译，那么使用 cmake .. -DUSE_CUDA=OFF
make -j
cd tools && python setup.py install

@HL0718 你最后一句没执行吧

weiyuhan commented 1 year ago

@heavenkiller2018 我也遇到了相同的问题 `./benchmark -p chatglm2-6b-int4.flm -f ../example/benchmark/prompts/hello.txt -b 16 -l 18

Load (200 / 200) Warmup... finish. AVX: ON AVX2: ON AARCH64: OFF Neon FP16: OFF Neon DOT: OFF

batch: 16 prompt token number = 256 prompt use 25.356812 s prompt speed = 10.095906 tokens / s output 128 tokens use 16.395212 s speed = 7.807157 tokens / s ` 我也是P40, CUDA 12.2, 会不会是P40的CUDA单元比较少导致的？

另外，初步调研了一下，可能是因为P40没有Tensor Cores，矩阵运算不太行，所以batch反而更慢？我是GPU小白，欢迎交流

Reference：

https://blog.csdn.net/pearl8899/article/details/112875396
https://timdettmers.com/2023/01/30/which-gpu-for-deep-learning/ : Tensor Cores are tiny cores that perform very efficient matrix multiplication. Since the most expensive part of any deep neural network is matrix multiplication Tensor Cores are very useful. In fast, they are so powerful, that I do not recommend any GPUs that do not have Tensor Cores.

yuanphoenix commented 1 year ago

我想请问一下，python cli.py -m chatglm -p chatglm-6b-int8.bin 这一步是在哪个目录下执行的，我先是在fastllm下创建了build-py的目录，然后把cd build-py cmake .. -DUSE_CUDA=ON -DPY_API=ON make -j4 这些都执行了，然后去到pyfastllm/demo下执行 python cli.py -m chatglm -p chatglm-6b-int8.bin报错了， No module named 'pyfastllm',可以给一下详细的步骤吗

你好，请问解决了吗？

HongyuJiang commented 1 year ago

应该是在fastllm/pyfastllm这个目录下去执行哈，这个下面有个readme可以参考 @yuanphoenix @HL0718

yuanphoenix commented 1 year ago

应该是在fastllm/pyfastllm这个目录下去执行哈，这个下面有个readme可以参考 @yuanphoenix @HL0718

谢谢，解决了，把送so文件放phton文件旁边就可以了，第一次见这种加载方式😂

wildkid1024 commented 1 year ago

后续将优化pyfastllm的安装方式。

Cloopen-ReLiNK commented 1 year ago

应该是在fastllm/pyfastllm这个目录下去执行哈，这个下面有个readme可以参考 @yuanphoenix @HL0718

谢谢，解决了，把送so文件放phton文件旁边就可以了，第一次见这种加载方式😂

这个可以具体说下吗？在fastllm/pyfastllm/examples下执行 python web_api.py -p?把哪些so文件cp进来？只有一个libfastllm_tools.so呀？

yuanphoenix commented 1 year ago

应该是在fastllm/pyfastllm这个目录下去执行哈，这个下面有个readme可以参考 @yuanphoenix @HL0718

谢谢，解决了，把送so文件放phton文件旁边就可以了，第一次见这种加载方式😂

这个可以具体说下吗？在fastllm/pyfastllm/examples下执行 python web_api.py -p?把哪些so文件cp进来？只有一个libfastllm_tools.so呀？

具体名字我忘了，就是编译pyfastllm得到的so文件。

wildkid1024 commented 1 year ago

现在在 pyfastllm 下可以直接安装

TylunasLi commented 1 year ago

P40 可以试验一下这个分支，增加编译选项CUDA_NO_TENSOR_CORE，修改了矩阵乘法的计算方式，速度能达到22-25token/s。但是矩阵乘法的计算方式修改会导致模型和原推理结果不完全对齐。

crosys commented 11 months ago

P40跑int8会不会比较快？官方规格表里P40是支持int8加速的，而且可以到47TOPS。

ztxz16 / fastllm