Gradient explosion or disappearance during training

zhengzangw / Sequence-Scheduling

PyTorch implementation of paper "Response Length Perception and Sequence Scheduling: An LLM-Empowered LLM Inference Pipeline".

71 stars 15 forks source link

Gradient explosion or disappearance during training #2

Closed qxpBlog closed 5 months ago

qxpBlog commented 5 months ago

@zhengzangw When use the dataset, which you provide, to train my model, There was a problem with gradient explosion or disappearance during training. What could be the reason for this?

HF_ENDPOINT=$HF_ENDPOINT CUDA_VISIBLE_DEVICES=1 python -m src.train \
   --model_name_or_path pytorch_model.bin \
   --data_path ./data/alpaca-train-10k-instruct.json \
   --output_dir ./ckpts/llama_pruned \
   --bf16 True \
   --tf32 True \
   --evaluation_strategy "no" \
   --lazy_preprocess True \
   --save_strategy "steps" \
   --save_steps 100 \
   --save_total_limit 2 \
   --logging_steps 1 \
   --num_train_epochs 3 \
   --per_device_train_batch_size 2 \
   --gradient_accumulation_steps 16 \
   --learning_rate 2e-5 \
   --weight_decay 0. \
   --warmup_ratio 0.03 \
   --lr_scheduler_type "cosine"

The problem is following: You can see that at the beginning of the training, the progress has already reached 98%.And the loss and gradient are 0.

zhengzangw commented 5 months ago

This is very strange, as we never enter this situation. Does the loss disappearance and grad norm nan happen at the beginning of the experiments? Or when does it happens?

qxpBlog commented 5 months ago

This is very strange, as we never enter this situation. Does the loss disappearance and grad norm nan happen at the beginning of the experiments? Or when does it happens?

at the beginning of the training.The progress has already reached 98%.That is to say, the training progress starts at 98%. the loss disappearance and grad norm nan happen at the beginning of the experiments

qxpBlog commented 5 months ago

This is very strange, as we never enter this situation. Does the loss disappearance and grad norm nan happen at the beginning of the experiments? Or when does it happens?

Now I know why my fine-tuning progress started at 98%, because I continued the previous fine-tuning progress, but I haven't solved the problems of gradient disappearance and gradient explosion yet.

qxpBlog commented 5 months ago

This is very strange, as we never enter this situation. Does the loss disappearance and grad norm nan happen at the beginning of the experiments? Or when does it happens?

Could it be because it's impossible to fine tune the pruned model. Because I want to fine tune my pruned model.

qxpBlog commented 5 months ago

This is very strange, as we never enter this situation. Does the loss disappearance and grad norm nan happen at the beginning of the experiments? Or when does it happens?

zhengzangw commented 5 months ago

Pruned model? Did you mean you prune the model for inference? In that case, I think maybe a pruned model cannot be fine-tuned.

qxpBlog commented 5 months ago

Pruned model? Did you mean you prune the model for inference? In that case, I think maybe a pruned model cannot be fine-tuned.

yes, I use the method LLM-Pruner to prune the model Llama-7B, then I would like to use your code to fine tune it so that it can better predict the number of generated tokens.

qxpBlog commented 5 months ago

Pruned model? Did you mean you prune the model for inference? In that case, I think maybe a pruned model cannot be fine-tuned.

But method LLM-Pruner can fine tune the pruned model, and I have tried it before, and it is feasible.

qxpBlog commented 5 months ago

Pruned model? Did you mean you prune the model for inference? In that case, I think maybe a pruned model cannot be fine-tuned.

I tried to use the pruned model directly without fine-tuning the length perception aspect of the model, and then predict the length stage. Although the prompt was given to only generate the predicted number of tokens, the model only generates specific content rather than predicting the number of tokens.

zhengzangw commented 5 months ago

7B model does not have a strong ability to follow instructions. And that's why we need to fine-tune the model for length prediction.

qxpBlog commented 5 months ago

7B model does not have a strong ability to follow instructions. And that's why we need to fine-tune the model for length prediction.

Yes, I noticed that if the 7B model is not used, it cannot predict the number of tokens.But when I was fine-tuning the 7B pruned model, there was a situation where the loss was 0 and grad_norm was nan. Could this be because the pruned model cannot predict the number of tokens during the process of fine-tuning.