ztxz16 / fastllm

纯c++的全平台llm加速库,支持python调用,chatglm-6B级模型单卡可达10000+token / s,支持glm, llama, moss基座,手机端流畅运行
Apache License 2.0
3.28k stars 332 forks source link

转化模型格式(.bin->.flm)时 #413

Open ColorfulDick opened 7 months ago

ColorfulDick commented 7 months ago

在转化SUS-Chat-34B模型(该模型完全兼容llama架构)为flm格式时,报了这个错:

root@5ce5bafeea81:/app# python glm_trans_flm.py 
Loading checkpoint shards: 100%|██████████████████████████████████████████████████| 7/7 [01:09<00:00,  9.90s/it]
convert ( 543 / 543 )
Warmup...
FastLLM Error: Reshape error.

terminate called after throwing an instance of 'std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >'
Aborted (core dumped)

转化脚本如下,也是参考flm官方提供的:

from transformers import AutoTokenizer, AutoModel,AutoModelForCausalLM
import torch

model_path = "./SUS-Chat-34B"
tokenizer = AutoTokenizer.from_pretrained(model_path,trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_path, device_map="auto", torch_dtype=torch.float16,trust_remote_code=True
).eval()

# 加入下面这两行,将huggingface模型转换成fastllm模型
# 目前from_hf接口只能接受原始模型,或者ChatGLM的int4, int8量化模型,暂时不能转换其它量化模型
from fastllm_pytools import llm
llm.set_device_map(["cuda:0", "cuda:1","cuda:2","cuda:3","cuda:4"])
model = llm.from_hf(model, tokenizer,dtype = "float16") # dtype支持 "float16", "int8", "int4"
model.save("./SUS-Chat-34B.flm")

该如何解决,cuda版本为12.2,同样的代码转chatglm3-6b和baichuan2都是没问题的

TylunasLi commented 7 months ago

我也在SUS-Chat-34B复现了问题,由于测试的环境没搞好,暂时没找到解决办法...

TylunasLi commented 7 months ago

测试了Yi-6B,发现是由于目前fastllm还未支持Grouped Query Attention导致的。正在修改中。