Closed qxpBlog closed 5 months ago
This is very strange, as we never enter this situation. Does the loss disappearance and grad norm nan happen at the beginning of the experiments? Or when does it happens?
This is very strange, as we never enter this situation. Does the loss disappearance and grad norm nan happen at the beginning of the experiments? Or when does it happens?
at the beginning of the training.The progress has already reached 98%.That is to say, the training progress starts at 98%. the loss disappearance and grad norm nan happen at the beginning of the experiments
This is very strange, as we never enter this situation. Does the loss disappearance and grad norm nan happen at the beginning of the experiments? Or when does it happens?
Now I know why my fine-tuning progress started at 98%, because I continued the previous fine-tuning progress, but I haven't solved the problems of gradient disappearance and gradient explosion yet.
This is very strange, as we never enter this situation. Does the loss disappearance and grad norm nan happen at the beginning of the experiments? Or when does it happens?
Could it be because it's impossible to fine tune the pruned model. Because I want to fine tune my pruned model.
This is very strange, as we never enter this situation. Does the loss disappearance and grad norm nan happen at the beginning of the experiments? Or when does it happens?
Pruned model? Did you mean you prune the model for inference? In that case, I think maybe a pruned model cannot be fine-tuned.
Pruned model? Did you mean you prune the model for inference? In that case, I think maybe a pruned model cannot be fine-tuned.
yes, I use the method LLM-Pruner
to prune the model Llama-7B
, then I would like to use your code to fine tune it so that it can better predict the number of generated tokens.
Pruned model? Did you mean you prune the model for inference? In that case, I think maybe a pruned model cannot be fine-tuned.
But method LLM-Pruner
can fine tune the pruned model, and I have tried it before, and it is feasible.
Pruned model? Did you mean you prune the model for inference? In that case, I think maybe a pruned model cannot be fine-tuned.
I tried to use the pruned model directly without fine-tuning the length perception aspect of the model, and then predict the length stage. Although the prompt was given to only generate the predicted number of tokens, the model only generates specific content rather than predicting the number of tokens.
7B model does not have a strong ability to follow instructions. And that's why we need to fine-tune the model for length prediction.
7B model does not have a strong ability to follow instructions. And that's why we need to fine-tune the model for length prediction.
Yes, I noticed that if the 7B model is not used, it cannot predict the number of tokens.But when I was fine-tuning the 7B pruned model, there was a situation where the loss
was 0 and grad_norm
was nan
. Could this be because the pruned model cannot predict the number of tokens during the process of fine-tuning.
@zhengzangw When use the dataset, which you provide, to train my model, There was a problem with gradient explosion or disappearance during training. What could be the reason for this?
The problem is following: You can see that at the beginning of the training, the progress has already reached 98%.And the loss and gradient are 0.