GPU资源调用异常 - Githubissues

*通过首页给出的推理脚本进行推理时发现仅在加载原始模型 AutoModel.from_pretrained 时存在GPU占用，后续过程都在CPU上执行。本地执行环境为centos 7，V100 32G 4。**备注：编译过程cmake .. -DUSE_CUDA=ON && make -j存在异常，通过cmake .. -DUSE_CUDA=ON -DCMAKE_CXX_STANDARD=17 && make -j 解决，不知是否因这种方式导致无法调用GPU

# 这是原来的程序，通过huggingface接口创建模型
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("THUDM/chatglm2-6b", trust_remote_code = True)
model = AutoModel.from_pretrained("THUDM/chatglm2-6b", trust_remote_code = True)

# 加入下面这两行，将huggingface模型转换成fastllm模型
# 目前from_hf接口只能接受原始模型，或者ChatGLM的int4, int8量化模型，暂时不能转换其它量化模型
from fastllm_pytools import llm
model = llm.from_hf(model, tokenizer, dtype = "float16") # dtype支持 "float16", "int8", "int4"

# 注释掉这一行model.eval()
#model = model.eval()

while True:
    t1 = time.time()
    input_text = input("请输入内容：")
    # 生成回复
    print(model.response(input_text))
    t2 = time.time()
    print(f"推理时间为：{t2-t1}")

ztxz16 / fastllm

GPU资源调用异常 #348