Open limcheekin opened 1 year ago
This is something I am interested in too.
It seems to work, loading the 3B version without qlora requires 14GB of GPU RAM, with qlora only 3GB VRAM. You can try it for yourself:
# Install latest bitsandbytes & transformers, accelerate from source
!pip install -q -U bitsandbytes
!pip install -q -U git+https://github.com/huggingface/transformers.git
!pip install -q -U git+https://github.com/huggingface/peft.git
!pip install -q -U git+https://github.com/huggingface/accelerate.git
# Other requirements for the demo
!pip install gradio
!pip install sentencepiece
import torch
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer, StoppingCriteria, StoppingCriteriaList, TextIteratorStreamer, BitsAndBytesConfig
# model_name = 'openlm-research/open_llama_3b_600bt_preview'
models= {
"open_Alpaca": "openllmplayground/openalpaca_3b_600bt_preview"
}
model_name = models["open_Alpaca"]
print(f"Starting to load the model {model_name} into memory")
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name,
quantization_config=bnb_config,
device_map={"":0})
# see https://github.com/openlm-research/open_llama#preview-weights-release-and-usage
tokenizer.bos_token_id, tokenizer.eos_token_id = 1,2
# same prompt as provided in https://crfm.stanford.edu/2023/03/13/alpaca.html
instruction = r'What is an alpaca? How is it different from a llama?'
'''
instruction = r'Write an e-mail to congratulate new Standford admits and mention that you are excited about meeting all of them in person.'
instruction = r'What is the capital of Tanzania?'
instruction = r'Write a well-thought out abstract for a machine learning paper that proves that 42 is the optimal seed for training neural networks.'
'''
prompt_no_input = f'Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\n{instruction}\n\n### Response:'
tokens = tokenizer.encode(prompt_no_input)
device = "cuda:0"
# note you have to add .to(device) here
tokens = torch.LongTensor(tokens).unsqueeze(0).to(device)
instance = {'input_ids': tokens,
'top_k': 50,
'top_p': 0.9,
'generate_len': 128}
length = len(tokens[0])
with torch.no_grad():
rest = model.generate(
input_ids=tokens,
max_length=length+instance['generate_len'],
use_cache=True,
do_sample=True,
top_p=instance['top_p'],
top_k=instance['top_k']
)
output = rest[0][length:]
string = tokenizer.decode(output, skip_special_tokens=True)
print(f'[!] Generation results: {string}')
Outcome:
Generation results: Alpacas are closely related to llamas. They are even part of the same family. Alpacas have soft fur and are generally smaller in size.
Just came to know about this project. I tried to attempt something similar with https://github.com/vihangd/alpaca-qlora which ports alpaca-lora to use QLoRA and you can use it to finetune open llama models.. I am trying to make it work with other models too.
Very nice attempt, keep me/us updated
Hi there,
Thanks for sharing.
Any plan to support QLoRA? Please see the following paper for more information: https://arxiv.org/abs/2305.14314
Thanks.