Closed pingu2k4 closed 6 years ago
--resume path/to/xxx.pth
Sorry, not sure what that means? I added path to the .model checkpoint file created on the first run, it there a .pth file somewhere I should be referring to instead? If so, where would this be? The directory I gave for --save-model-dir only contains a .model file Epoch_0iters_8000_Sat_Dec__9_18:10:43_2017_1.0_5.0.model
After resuming, it starts at Epoch 1.
its progress is at 0 iterations into the first epoch however, whereas when the checkpoint was saved it was 8000 iterations into the total of 123287. When it resumes, it doesn't resume from the number of iterations it was previously at?
That's correct. It doesn't care about epochs and iters, only load weights.
Ohh OK, so I could just keep coming back day after day, and training a bit here or there etc and it would only really ever get better etc? Will start doing that once I have a bunch of styles set in stone that I want to use.
Thanks for this BTW! Not really sure if its of use, but im writing a shell script that allows for running multiple of these. Its kinda in progress still, but heres what I'm using: https://pastebin.com/Gi7cr25C
It does eval or optim, and has 3 modes: 121, which acts same as normal. 12Many, which applies many styles to 1 original image, and Many2Many which applies many styles to many original images.
It also Attempts at 2048 content size, and keeps trying until it finds a content-size we have enough memory for.
Hey,
So I started training a model, but seeing how long it was going to take I wanted to double check I could successfully resume training.
I ran:
python3 main.py train --epochs 4 --style-folder images/xmas-styles/ --save-model-dir trained_models/
until it generated the first checkpoint, then I ranpython3 main.py train --epochs 4 --style-folder images/xmas-styles/ --save-model-dir trained_models/ --resume trained_models/Epoch_0iters_8000_Sat_Dec__9_18\:10\:43_2017_1.0_5.0.model
and waiting for the first feedback report, which wasSat Dec 9 18:17:09 2017 Epoch 1: [2000/123287] content: 254020.831359 style: 1666218.549250 total: 1920239.380609
so it appeared to not have resumed at all.Also slight side question... Say I train with
--epochs 4
til I get final model... If I were to use the last checkpoint before final to resume, but set--epochs 5
or higher, would that work correctly and just keep going through to 5 epochs before generating another final, and have no issues etc?