yjh0410 / PyTorch_YOLO-Family

Apache License 2.0
158 stars 47 forks source link

Resuming the network training #10

Closed kulkarnikeerti closed 2 years ago

kulkarnikeerti commented 2 years ago

@yjh0410 Sorry for these many clarifications. I was wondering what needs to be the input to the --resume to resume the model training. Is it the weights we save at the end of eval_epoch? Because, I don't see any checkpoints being saved during training.

yjh0410 commented 2 years ago

@kulkarnikeerti That's all right. If you have any problems about this project, do not hesitate to ask.

--resume is to keep training as long as the training stage is interrupted unexpectedly.

After evaluation stage, the model weight will be saved as long as the current mAP is higher than the best mAP(default is -1.).

In train.py the path_to_save corresponds to the path to save model weight.

kulkarnikeerti commented 2 years ago

@yjh0410 I understand that completely. What I don't understand is the --resume value. By defaults it's set to None. If I want to resume the model from where I left, what needs to be the value of this? Is it the saved model weight path after evaluation stage?

If yes, would it also consider the epoch, optimizer and other parameters from previous training to resume the training? Because from code its quite confusing for me, since it doesn't save other details.

yjh0410 commented 2 years ago

@kulkarnikeerti You can give a path to .pth file to --resume,for example, --resume weight/coco/yolo_nano/yolo_nano.pth.

For now, it dose not consider the epoch, optimizer or other training parameters, so it is not complete. It only saves the model weight.

kulkarnikeerti commented 2 years ago

@yjh0410 Okay thanks. I would modify that according to my requirement. Thanks a lot for clarifying that:)

yjh0410 commented 2 years ago

@kulkarnikeerti You are welcome.

To be honest, when I build this project, some of my knowledge of YOLO is incomplete, so there may be errors in some details, which may lead to ambiguity. However, I don't have the spare energy to refactor this project right now, so while some models perform well, they still need to be refined.