Open WanBenLe opened 8 months ago
Which script are you running?
CUDA_VISIBLE_DEVICES=0,1,2,3 accelerate launch --multi_gpu --num_processes=4 --debug './~.py'
train_gsm8k.py will raise the same error.
Could you provide the full training command? Multi gpu training for quantized models, unfortunately, is not supported yet. This is because we use bitsandbytes
quantization, which doesn't support it. So, one can only train a full precision model by multiple GPUs. To do so, it is important to enable --full_precision
. (I have changed the explanation about this argument. It was wrong.)
We provide example training scripts here. For your case,
# train 4-bit 64-rank llama-2-7b with LoftQ on GSM8K using 8 A100s
accelerate launch train_gsm8k.py \
--full_precision \
--model_name_or_path LoftQ/Llama-2-7b-hf-4bit-64rank \
--learning_rate 3e-4 \
--seed 11 \
--expt_name gsm8k_llama2_7b_4bit_64rank_loftq_fake \
--output_dir exp_results/ \
--num_train_epochs 6 \
--per_device_train_batch_size 2 \
--gradient_accumulation_steps 1 \
--evaluation_strategy "no" \
--save_strategy "epoch" \
--weight_decay 0.1 \
--warmup_ratio 0.03 \
--lr_scheduler_type "cosine" \
--logging_steps 10 \
--do_train \
--report_to tensorboard
Well, thaks for your help. With my best wishes.
Now QLoRA can be used with FSDP/Deepspeed ZeRO, I was wondering if loftq can be used as combo.
I set BnB config as recommended by https://huggingface.co/docs/peft/main/en/accelerate/deepspeed#use-peft-qlora-and-deepspeed-with-zero3-for-finetuning-large-models-on-multiple-gpus --> results in program hanging up.
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
**bnb_4bit_use_double_quant=True,**
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
# Notice that torch_dtype for AutoModelForCausalLM is same as the bnb_4bit_quant_storage data type.
# For FSDP/ Deepspeed ZeRO
**bnb_4bit_quant_storage=torch.bfloat16,**
)
model = LlamaForCausalLM.from_pretrained(
**_meta-llama/Llama-2-7b-chat-hf_**,
quantization_config=quantization_config,
torch_dtype=torch.bfloat16,
config=config,
attn_implementation=attn_implementation,
)
config = LoraConfig(
r=cfg.training.lora_config.lora_r,
lora_alpha=cfg.training.lora_config.lora_alpha,
target_modules=cfg.training.lora_config.lora_target_modules,
lora_dropout=cfg.training.lora_config.lora_dropout,
bias="none",
task_type="CAUSAL_LM",
init_lora_weights = "loftq",
loftq_config = LoftQConfig(
loftq_bits=4,
loftq_iter=1
),
)
model = get_peft_model(model, config) # hang here
Log:
Weight: (4194304, 1) | Rank: 64 | Number Iter: 1 | Num Bits: 4
....
(Then stuck at initializing peft model...)
bnb_4bit_quant_storage=torch.bfloat16
and bnb_4bit_use_double_quant=True
arg, but even if I turned off these two args. I still cannot make it work. If you have any feedback please let me know :(
Could you provide what the value of cfg.base_model
is?
If it is a model from LoftQ HuggingFace repo, the problem could be the way how they implement QLoRA with FSDP. Chances are they shard the weight and then quantize the sharded weight. However, the checkpoints on LoftQ HuggingFace repo are already quantized, so they may fail to shard the quantized weight.
If it is the model you obtained by quantized_save.py
in this repo, it should have the same logic as QLoRA and wouldn't be any problem.
Please let me know which case you are in.
Thank you for your prompt reply.
Sorry I did neither these two cases. I was trying to init lora weight by loftq for the original Llama 2 base model. I would like to do it on the fly if possible. I have updated my previous post to provide full code snippet about where I stuck.
(I know that this is not the recommended flow but I don't understand why, other than the latency problem.)
LoftQ obtains the quantized weight $Q$ and LoRA adapters $A, B$ by minimizing $||W - Q - AB^{\top}||$, where $W$ is the full precision weight. When you call model = get_peft_model(model, config)
, we require the model
to be the full precision, but the model
in your code is actually already quantized. The algorithm treats the quantized weight as the full precision weight $W$ and therefore fails.
It is also worth noting that even if you change the model
to full precision, unfortunately, you still can't do it on the fly because get_peft_model(model, config)
returns a quantization-equivalent full precision model (aka fake quantized model). That's why we recommend to apply LoftQ first and then load the fake quantized model by bnb to turn it into real quantized model.
Thanks! I will change my current flow and give it a try. (Sorry for hijacking the multi-GPU thread... anyways)
When I set: import os os.environ['CUDA_VISIBLE_DEVICES'] = '0,1,2,3' will raise error : ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [42,0,0], thread: [64,0,0] Assertion
srcIndex < srcSelectDimSize
failed.return (element == self).any().item() # type: ignore[union-attr] RuntimeError: CUDA error: device-side assert triggered
how can I do this?