First of all, thank you for the great model!
I noticed something strange while finetuning the model. Indeed, it seems that resuming finetuning actually resumes 1 epoch before the one specified.
Replicate
i finetuned the model using:
accelerate launch --mixed_precision=fp16 --num_process=1 train_finetune_accelerate.py --config_path <path/to/config.yml>
So far, everything went well, until the finetuning crashed (this was to be expected with the parameters I chose). The last epoch I used before it crashed was the [11/100]:
back in the terminal, I rerun the command:
accelerate launch --mixed_precision=fp16 --num_process=1 train_finetune_accelerate.py --config_path <path/to/config.yml>
I waited a bit to see that no epoch were stored. I am assuming it stored epoch_2nd_00009.pth.
Conclusion
It means that resuming finetuning probably actually uses the right epoch (the one I linked in the config.yml), but resumes finetuning under the wrong number (ie. 10 instead of 11). Thus, it might also uses the wrong parameters (diff epoch=10 so would be used on [11/100] if I understand well)
Hello there,
First of all, thank you for the great model! I noticed something strange while finetuning the model. Indeed, it seems that resuming finetuning actually resumes 1 epoch before the one specified.
Replicate
i finetuned the model using:
accelerate launch --mixed_precision=fp16 --num_process=1 train_finetune_accelerate.py --config_path <path/to/config.yml>
So far, everything went well, until the finetuning crashed (this was to be expected with the parameters I chose). The last epoch I used before it crashed was the [11/100]:
At this point, as I am currently working on 11th epoch, the last completed one was the 10th epoch, saved as
epoch_2nd_00009.pth
Then, when I went to modify the config.yml, I set the following parameters to:
back in the terminal, I rerun the command:
accelerate launch --mixed_precision=fp16 --num_process=1 train_finetune_accelerate.py --config_path <path/to/config.yml>
what is now displayed is:
I waited a bit to see that no epoch were stored. I am assuming it stored
epoch_2nd_00009.pth
.Conclusion
It means that resuming finetuning probably actually uses the right epoch (the one I linked in the config.yml), but resumes finetuning under the wrong number (ie. 10 instead of 11). Thus, it might also uses the wrong parameters (diff epoch=10 so would be used on [11/100] if I understand well)