ymcui / Chinese-LLaMA-Alpaca-2

中文LLaMA-2 & Alpaca-2大模型二期项目 + 64K超长上下文模型 (Chinese LLaMA-2 & Alpaca-2 LLMs with 64K long context models)
Apache License 2.0
7.04k stars 581 forks source link

loss comes to 0 #443

Closed Abolfazl-kr closed 8 months ago

Abolfazl-kr commented 9 months ago

Check before submitting issues

Type of Issue

Model training and fine-tuning

Base Model

Chinese-LLaMA-2 (7B/13B)

Operating System

Linux

Describe your issue in detail

I am experiencing an issue with training loss in my deep learning model, and I would like to ask for help in resolving it. I'm training llama2 on another language and i faced problem of loss overflow. i use four distributed 16GB vram T4 and use fp16 (i couldn't use bf16 because of T4)

Specifically, I set loss_scale=0 (like the following deep speed config) , the loss scale overflowed in deep speed and bring me this error : "FloatingPointError: Minimum loss scale reached"

    "fp16": {
        "enabled": "auto",
        "loss_scale": 0,
        "loss_scale_window": 100,
        "initial_scale_power": 16,
        "hysteresis": 1,
        "min_loss_scale": 1e-10
    },

The training loss came near to 4, in about 4500 steps (max step was 42127 and i was training 1GB text with run_clm_pt_with_peft), after 4500 steps, it increased very significantly, causing the model to break down.

    {
      "epoch": 0.11,
      "learning_rate": 0.0001945353187718296,
      "loss": 4.0682,
      "step": 4540
    },
    {
      "epoch": 0.11,
      "learning_rate": 0.00019451583018681562,
      "loss": 4.0441,
      "step": 4550
    },
    {
      "epoch": 0.11,
      "learning_rate": 0.00019449386523582728,
      "loss": 4.6709,
      "step": 4560
    },
    {
      "epoch": 0.11,
      "learning_rate": 0.00019446940970948925,
      "loss": 4.9938,
      "step": 4570
    },
    {
      "epoch": 0.11,
      "learning_rate": 0.00019444490153817877,
      "loss": 5.1778,
      "step": 4580
    },
    {
      "epoch": 0.11,
      "learning_rate": 0.00019442034073555343,
      "loss": 5.5932,
      "step": 4590
    },
    {
      "epoch": 0.11,
      "learning_rate": 0.00019439819102472819,
      "loss": 6.5669,
      "step": 4600
    },
.
.
.

(the loss raise to 400)

I tried to solve this issue, but I could not. However, when I set the loss_scale to 1 in deepspeed config, then the model start training but the loss came to 0. I would appreciate any guidance on how to resolve this issue.

    "fp16": {
        "enabled": "auto",
        "loss_scale": 1,
        "loss_scale_window": 100,
        "initial_scale_power": 16,
        "hysteresis": 1,
        "min_loss_scale": 1e-10
    },

first, could you tell me the effect of loss_scale? i searched it and i found if it sets to 0 we would have dynamic loss and if we set a number it wouldn't be. but i didn't understanding.

    {
      "epoch": 0.07,
      "learning_rate": 0.0001987099956602297,
      "loss": 1.9179,
      "step": 2810
    },
    {
      "epoch": 0.07,
      "learning_rate": 0.0001987099956602297,
      "loss": 1.9276,
      "step": 2820
    },
    {
      "epoch": 0.07,
      "learning_rate": 0.00019870880019146788,
      "loss": 2.8882,
      "step": 2830
    },
    {
      "epoch": 0.07,
      "learning_rate": 0.00019870880019146788,
      "loss": 0.0,
      "step": 2840
    },
    {
      "epoch": 0.07,
      "learning_rate": 0.00019870880019146788,
      "loss": 0.0,
      "step": 2850
    },
    {
      "epoch": 0.07,
      "learning_rate": 0.00019870880019146788,
      "loss": 0.0,
      "step": 2860
    },
    {
      "epoch": 0.07,
      "learning_rate": 0.00019870880019146788,
      "loss": 0.1851,
      "step": 2870
    },
    {
      "epoch": 0.07,
      "learning_rate": 0.00019870880019146788,
      "loss": 0.0,
      "step": 2880
    },

### Dependencies (must be provided for code-related issues)

Name: peft Version: 0.5.0

Name: torch Version: 2.0.1

cuda 11.8

python version: 3.10.13

Name: transformers Version: 4.35.0.dev0


### Execution logs or screenshots

torchrun --nnodes 1 --nproc_per_node 4 run_clm_pt_with_peft.py \ --deepspeed ds_zero2_no_offload.json \ --model_name_or_path /home/hadoop/.cache/huggingface/hub/models--meta-llama--Llama-2-7b-hf/snapshots/6fdf2e60f86ff2481f2241aaee459f85b5b0bbb9/ \ --tokenizer_name_or_path /home/hadoop/abolfazl/Chinese-LLaMA-Alpaca-2/scripts/tokenizer/merged_tokenizer_hf \ --dataset_dir /home/hadoop/abolfazl/parvin2 \ --data_cache_dir /home/hadoop/abolfazl/Chinese-LLaMA-Alpaca-2/scripts/training/cache \ --validation_split_percentage 0.001 \ --per_device_train_batch_size 8 \ --do_train \ --seed $RANDOM \ --fp16 \ --num_train_epochs 1 \ --lr_scheduler_type cosine \ --learning_rate 2e-4 \ --warmup_ratio 0.001 \ --weight_decay 0.001 \ --logging_strategy steps \ --logging_steps 10 \ --save_strategy steps \ --save_total_limit 3 \ --save_steps 1000 \ --gradient_accumulation_steps 1 \ --preprocessing_num_workers 8 \ --block_size 128 \ --output_dir /home/hadoop/abolfazl/Chinese-LLaMA-Alpaca-2/out_pt_secondtry \ --overwrite_output_dir \ --ddp_timeout 30000 \ --logging_first_step True \ --lora_rank 64 \ --lora_alpha 16 \ --trainable "q_proj,v_proj,k_proj,o_proj,gate_proj,down_proj,up_proj" \ --lora_dropout 0.05 \ --modules_to_save "embed_tokens,lm_head" \ --torch_dtype float16 \ --load_in_kbits 4 \ --gradient_checkpointing \ --ddp_find_unused_parameters False

iMountTai commented 9 months ago

The combination of 4bit, fp16, and lora may be the reason for this phenomenon. There is nothing more effective than bf16. Or you could try increasing total_batch_size to alleviate the problem.

Abolfazl-kr commented 9 months ago

thank you so much @iMountTai ! I will try it right now.

Abolfazl-kr commented 9 months ago

I couldn't increase bach_size because of OOM. now i'm trying to use mixed precision or fp32. maybe it works.

hank0316 commented 4 months ago

I couldn't increase bach_size because of OOM. now i'm trying to use mixed precision or fp32. maybe it works. @Abolfazl-kr may I ask you how you solve this issue?

Abolfazl-kr commented 4 months ago

@hank0316 unfortunately I change my GPU cards. with A5000 24 Gb, the time decreased and I control GPU usage.