模型量化运行 - Githubissues

zjunlp / DeepKE

[EMNLP 2022] An Open Toolkit for Knowledge Graph Extraction and Construction

http://deepke.zjukg.cn/

MIT License

3.57k stars 686 forks source link

模型量化运行 #458

Closed HuiGe88 closed 7 months ago

HuiGe88 commented 7 months ago

您好，关于deepke-llm，由于我的显存配置不够，我想通过使用量化的llama模型来运行这个项目，目前不考虑性能损失，只想看一下运行效果，请问我应该使用哪一版的llama量化模型才能顺利运行该项目，您能否推荐一款模型，其次，我想问，要使用量化后的llama模型，我还需要在原项目基础上做出哪些修改，非常感谢您的指导，你们的工作非常有价值，对我帮助很大，谢谢你们！

guihonghao commented 7 months ago

你可以使用 https://huggingface.co/zjunlp/baichuan2-13b-iepile-lora ，4bits量化。

HuiGe88 commented 7 months ago

您好，感谢您的指导，那请问我是否还需要下载baichuan的大模型

HuiGe88 commented 7 months ago

是下载Baichuan2-13B-Chat-4bit这个模型吗

guihonghao commented 7 months ago

需要下载这两个模型model_path = 'baichuan-inc/Baichuan2-13B-Chat' lora_path = 'zjunlp/baichuan2-13b-iepile-lora'

guihonghao commented 7 months ago

import torch
from transformers import BitsAndBytesConfig

quantization_config=BitsAndBytesConfig(     
    load_in_4bit=True,
    llm_int8_threshold=6.0,
    llm_int8_has_fp16_weight=False,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
)
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    config=config,
    device_map="auto", 
    quantization_config=quantization_config,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
)

model = PeftModel.from_pretrained(
    model,
    lora_path,
)

README中有介绍如何4bit量化，但是显存占用量也在16GB左右

HuiGe88 commented 7 months ago

我的显存只有12G

HuiGe88 commented 7 months ago

能否尝试用量化的baichaun

guihonghao commented 7 months ago

你可以试试

zxlzr commented 7 months ago

请问您还有其他问题吗