Closed Abolfazl-kr closed 8 months ago
The combination of 4bit, fp16, and lora may be the reason for this phenomenon. There is nothing more effective than bf16
. Or you could try increasing total_batch_size
to alleviate the problem.
thank you so much @iMountTai ! I will try it right now.
I couldn't increase bach_size because of OOM. now i'm trying to use mixed precision or fp32. maybe it works.
I couldn't increase bach_size because of OOM. now i'm trying to use mixed precision or fp32. maybe it works. @Abolfazl-kr may I ask you how you solve this issue?
@hank0316 unfortunately I change my GPU cards. with A5000 24 Gb, the time decreased and I control GPU usage.
Check before submitting issues
Type of Issue
Model training and fine-tuning
Base Model
Chinese-LLaMA-2 (7B/13B)
Operating System
Linux
Describe your issue in detail
I am experiencing an issue with training loss in my deep learning model, and I would like to ask for help in resolving it. I'm training llama2 on another language and i faced problem of loss overflow. i use four distributed 16GB vram T4 and use fp16 (i couldn't use bf16 because of T4)
Specifically, I set loss_scale=0 (like the following deep speed config) , the loss scale overflowed in deep speed and bring me this error : "FloatingPointError: Minimum loss scale reached"
The training loss came near to 4, in about 4500 steps (max step was 42127 and i was training 1GB text with run_clm_pt_with_peft), after 4500 steps, it increased very significantly, causing the model to break down.
(the loss raise to 400)
I tried to solve this issue, but I could not. However, when I set the loss_scale to 1 in deepspeed config, then the model start training but the loss came to 0. I would appreciate any guidance on how to resolve this issue.
first, could you tell me the effect of loss_scale? i searched it and i found if it sets to 0 we would have dynamic loss and if we set a number it wouldn't be. but i didn't understanding.
Name: peft Version: 0.5.0
Name: torch Version: 2.0.1
cuda 11.8
python version: 3.10.13
Name: transformers Version: 4.35.0.dev0
torchrun --nnodes 1 --nproc_per_node 4 run_clm_pt_with_peft.py \ --deepspeed ds_zero2_no_offload.json \ --model_name_or_path /home/hadoop/.cache/huggingface/hub/models--meta-llama--Llama-2-7b-hf/snapshots/6fdf2e60f86ff2481f2241aaee459f85b5b0bbb9/ \ --tokenizer_name_or_path /home/hadoop/abolfazl/Chinese-LLaMA-Alpaca-2/scripts/tokenizer/merged_tokenizer_hf \ --dataset_dir /home/hadoop/abolfazl/parvin2 \ --data_cache_dir /home/hadoop/abolfazl/Chinese-LLaMA-Alpaca-2/scripts/training/cache \ --validation_split_percentage 0.001 \ --per_device_train_batch_size 8 \ --do_train \ --seed $RANDOM \ --fp16 \ --num_train_epochs 1 \ --lr_scheduler_type cosine \ --learning_rate 2e-4 \ --warmup_ratio 0.001 \ --weight_decay 0.001 \ --logging_strategy steps \ --logging_steps 10 \ --save_strategy steps \ --save_total_limit 3 \ --save_steps 1000 \ --gradient_accumulation_steps 1 \ --preprocessing_num_workers 8 \ --block_size 128 \ --output_dir /home/hadoop/abolfazl/Chinese-LLaMA-Alpaca-2/out_pt_secondtry \ --overwrite_output_dir \ --ddp_timeout 30000 \ --logging_first_step True \ --lora_rank 64 \ --lora_alpha 16 \ --trainable "q_proj,v_proj,k_proj,o_proj,gate_proj,down_proj,up_proj" \ --lora_dropout 0.05 \ --modules_to_save "embed_tokens,lm_head" \ --torch_dtype float16 \ --load_in_kbits 4 \ --gradient_checkpointing \ --ddp_find_unused_parameters False