Unable to resume training

pingu2k4 commented 6 years ago

Hey,

So I started training a model, but seeing how long it was going to take I wanted to double check I could successfully resume training.

I ran: python3 main.py train --epochs 4 --style-folder images/xmas-styles/ --save-model-dir trained_models/ until it generated the first checkpoint, then I ran python3 main.py train --epochs 4 --style-folder images/xmas-styles/ --save-model-dir trained_models/ --resume trained_models/Epoch_0iters_8000_Sat_Dec__9_18\:10\:43_2017_1.0_5.0.model and waiting for the first feedback report, which was Sat Dec 9 18:17:09 2017 Epoch 1: [2000/123287] content: 254020.831359 style: 1666218.549250 total: 1920239.380609 so it appeared to not have resumed at all.

Also slight side question... Say I train with --epochs 4 til I get final model... If I were to use the last checkpoint before final to resume, but set --epochs 5 or higher, would that work correctly and just keep going through to 5 epochs before generating another final, and have no issues etc?

zhanghang1989 commented 6 years ago

--resume path/to/xxx.pth

pingu2k4 commented 6 years ago

Sorry, not sure what that means? I added path to the .model checkpoint file created on the first run, it there a .pth file somewhere I should be referring to instead? If so, where would this be? The directory I gave for --save-model-dir only contains a .model file Epoch_0iters_8000_Sat_Dec__9_18:10:43_2017_1.0_5.0.model

zhanghang1989 commented 6 years ago

After resuming, it starts at Epoch 1.

pingu2k4 commented 6 years ago

its progress is at 0 iterations into the first epoch however, whereas when the checkpoint was saved it was 8000 iterations into the total of 123287. When it resumes, it doesn't resume from the number of iterations it was previously at?

zhanghang1989 commented 6 years ago

That's correct. It doesn't care about epochs and iters, only load weights.

pingu2k4 commented 6 years ago

Ohh OK, so I could just keep coming back day after day, and training a bit here or there etc and it would only really ever get better etc? Will start doing that once I have a bunch of styles set in stone that I want to use.

Thanks for this BTW! Not really sure if its of use, but im writing a shell script that allows for running multiple of these. Its kinda in progress still, but heres what I'm using: https://pastebin.com/Gi7cr25C

It does eval or optim, and has 3 modes: 121, which acts same as normal. 12Many, which applies many styles to 1 original image, and Many2Many which applies many styles to many original images.

It also Attempts at 2048 content size, and keeps trying until it finds a content-size we have enough memory for.

zhanghang1989 / PyTorch-Multi-Style-Transfer

Unable to resume training #5