loftQ can not use multi gpu to train

WanBenLe commented 8 months ago

When I set: import os os.environ['CUDA_VISIBLE_DEVICES'] = '0,1,2,3' will raise error : ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [42,0,0], thread: [64,0,0] Assertion srcIndex < srcSelectDimSize failed.

return (element == self).any().item() # type: ignore[union-attr] RuntimeError: CUDA error: device-side assert triggered

how can I do this?

yxli2123 commented 8 months ago

Which script are you running?

WanBenLe commented 8 months ago

CUDA_VISIBLE_DEVICES=0,1,2,3 accelerate launch --multi_gpu --num_processes=4 --debug './~.py'

train_gsm8k.py will raise the same error.

yxli2123 commented 8 months ago

Could you provide the full training command? Multi gpu training for quantized models, unfortunately, is not supported yet. This is because we use bitsandbytes quantization, which doesn't support it. So, one can only train a full precision model by multiple GPUs. To do so, it is important to enable --full_precision. (I have changed the explanation about this argument. It was wrong.)

We provide example training scripts here. For your case,

# train 4-bit 64-rank llama-2-7b with LoftQ on GSM8K using 8 A100s
accelerate launch train_gsm8k.py \
  --full_precision \
  --model_name_or_path LoftQ/Llama-2-7b-hf-4bit-64rank \
  --learning_rate 3e-4 \
  --seed 11 \
  --expt_name gsm8k_llama2_7b_4bit_64rank_loftq_fake \
  --output_dir exp_results/ \
  --num_train_epochs 6 \
  --per_device_train_batch_size 2 \
  --gradient_accumulation_steps 1 \
  --evaluation_strategy "no" \
  --save_strategy "epoch" \
  --weight_decay 0.1 \
  --warmup_ratio 0.03 \
  --lr_scheduler_type "cosine" \
  --logging_steps 10 \
  --do_train \
  --report_to tensorboard

WanBenLe commented 8 months ago

Well, thaks for your help. With my best wishes.

skyshine102 commented 4 months ago

Now QLoRA can be used with FSDP/Deepspeed ZeRO, I was wondering if loftq can be used as combo.

I set BnB config as recommended by https://huggingface.co/docs/peft/main/en/accelerate/deepspeed#use-peft-qlora-and-deepspeed-with-zero3-for-finetuning-large-models-on-multiple-gpus --> results in program hanging up.

    quantization_config = BitsAndBytesConfig(
            load_in_4bit=True,
            **bnb_4bit_use_double_quant=True,**
            bnb_4bit_quant_type="nf4",
            bnb_4bit_compute_dtype=torch.bfloat16,
            # Notice that torch_dtype for AutoModelForCausalLM is same as the bnb_4bit_quant_storage data type. 
            # For FSDP/ Deepspeed ZeRO
            **bnb_4bit_quant_storage=torch.bfloat16,** 
        )
    model = LlamaForCausalLM.from_pretrained(
        **_meta-llama/Llama-2-7b-chat-hf_**, 
        quantization_config=quantization_config,
        torch_dtype=torch.bfloat16,
        config=config,
        attn_implementation=attn_implementation,
    )
    config = LoraConfig(
        r=cfg.training.lora_config.lora_r,
        lora_alpha=cfg.training.lora_config.lora_alpha,
        target_modules=cfg.training.lora_config.lora_target_modules,
        lora_dropout=cfg.training.lora_config.lora_dropout,
        bias="none",
        task_type="CAUSAL_LM",
        init_lora_weights = "loftq",
        loftq_config = LoftQConfig(
            loftq_bits=4, 
            loftq_iter=1
        ),
    )
    model = get_peft_model(model, config) # hang here

Log:

Weight: (4194304, 1)  | Rank: 64 | Number Iter: 1 |  Num Bits: 4
....
(Then stuck at initializing peft model...)

I'm using peft==0.11.1, bnb==0.43.1.
I'm not sure if the weight shape is expected.
I was wondering if this is due to the bnb_4bit_quant_storage=torch.bfloat16 and bnb_4bit_use_double_quant=Truearg, but even if I turned off these two args. I still cannot make it work.

If you have any feedback please let me know :(

yxli2123 commented 4 months ago

Could you provide what the value of cfg.base_model is?

If it is a model from LoftQ HuggingFace repo, the problem could be the way how they implement QLoRA with FSDP. Chances are they shard the weight and then quantize the sharded weight. However, the checkpoints on LoftQ HuggingFace repo are already quantized, so they may fail to shard the quantized weight.

If it is the model you obtained by quantized_save.py in this repo, it should have the same logic as QLoRA and wouldn't be any problem.

Please let me know which case you are in.

skyshine102 commented 4 months ago

Thank you for your prompt reply.
Sorry I did neither these two cases. I was trying to init lora weight by loftq for the original Llama 2 base model. I would like to do it on the fly if possible. I have updated my previous post to provide full code snippet about where I stuck. (I know that this is not the recommended flow but I don't understand why, other than the latency problem.)

yxli2123 commented 4 months ago

LoftQ obtains the quantized weight $Q$ and LoRA adapters $A, B$ by minimizing $||W - Q - AB^{\top}||$, where $W$ is the full precision weight. When you call model = get_peft_model(model, config), we require the model to be the full precision, but the model in your code is actually already quantized. The algorithm treats the quantized weight as the full precision weight $W$ and therefore fails.

It is also worth noting that even if you change the model to full precision, unfortunately, you still can't do it on the fly because get_peft_model(model, config) returns a quantization-equivalent full precision model (aka fake quantized model). That's why we recommend to apply LoftQ first and then load the fake quantized model by bnb to turn it into real quantized model.

skyshine102 commented 4 months ago

Thanks! I will change my current flow and give it a try. (Sorry for hijacking the multi-GPU thread... anyways)

yxli2123 / LoftQ

loftQ can not use multi gpu to train #17