Closed TouqeerAhmad closed 4 years ago
Also, one of the suggestion seems to be using lower learning rate for PCD module -- can you please share the code file to do that. And also what is appropriate learning rate for PCD module?
First of all, we find that 1) we can train 'M' models from scratch without such offset warnings. That is, the training is stable. 2) However, if we train 'L' models, it is very fragile and the offset warnings appear occasionally. If the offset is large than 100, it means that the offsets in dcn are wrongly predicted (too large offsets are meaningless ). The performance of these models is also poor. We think the reason is that when we train the large model with dcn, the offsets in dcn is more fragile.
In the competition, we train such large models from smaller models (from C=64 models to C=128 models and then to B=40 models). Even with such training schemes, we still encounter the wrong/too large offsets. We do not have a nice solution and just stop it and resume from the nearest normal model. Here, normal model means that their offsets are normal and are not too large.
The training procedures in the competitions are complex and actually we do not remember the concrete steps. We now provide the training schemes for the "M" models. We want to provide a simple and effective way to reproduce the "L" models. I think I need another two or three weeks to explore such ways.
We are developing more stable and efficient models, but the work is still in progress.
Thank you! this clears my doubts.
For me it is becoming more an more frequent as the training progresses -- seems to be very unstable.
For large models, the unstable offset phenomenon is indeed very frequent. 1) start from the most recent normal model 2) may try to use a smaller learning rate. (Sometimes, too large restart learning rate can also lead to this problem. You can use a smaller learning rate for restarts (by setting restart_weights
) )
We do not have an effective way to prevent it.
Hello, I am trying to train this chain C64B10woTSA -> C128B10woTSA -> C128B40woTSA -> C128B40wTSA. I encounter this kind of errors on the second stage
File "train.py", line 311, in <module>
main()
File "train.py", line 130, in main
model = create_model(opt)
File "/data/denis/EDVR-master/codes/models/__init__.py", line 17, in create_model
m = M(opt)
File "/data/denis/EDVR-master/codes/models/Video_base_model.py", line 33, in __init__
self.load()
File "/data/denis/EDVR-master/codes/models/Video_base_model.py", line 163, in load
self.load_network(load_path_G, self.netG, self.opt['path']['strict_load'])
File "/data/denis/EDVR-master/codes/models/base_model.py", line 94, in load_network
network.load_state_dict(load_net_clean, strict=strict)
File "/data/anaconda3/envs/dwx815999/lib/python3.7/site-packages/torch/nn/modules/module.py", line 830, in load_state_dict
Traceback (most recent call last):
File "train.py", line 311, in <module>
self.__class__.__name__, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for EDVR:
size mismatch for upconv1.bias: copying a param with shape torch.Size([256])
from checkpoint, the shape in current model is torch.Size([512]).
size mismatch for upconv2.weight: copying a param with shape torch.Size([256, 64, 3, 3]) from
checkpoint, the shape in current model is torch.Size([256, 128, 3, 3]).
I am trying to change nf from 64 to 128, strict_load is false. Am I doing everything right? I searched this on the internet and it seems it is not possible to use load_state_dict() for loading nf=64 model into nf=128 model.
Here is the config file which produces the error
#### general settings
name: C128RB10woTSA_001_EDVRwoTSA_scratch_lr4e-4_600k_REDS_LrCAR4S
use_tb_logger: false
model: video_base
distortion: sr
scale: 4
gpu_ids: [0,1,2,3,4,5]
#### datasets
datasets:
train:
name: REDS
mode: REDS
interval_list: [1]
random_reverse: false
border_mode: false
dataroot_GT: ../datasets/REDS/old_gt #../datasets/REDS/train_sharp_wval.lmdb
dataroot_LQ: ../datasets/REDS/old_sblur #../datasets/REDS/train_sharp_bicubic_wval.lmdb
cache_keys: ~
N_frames: 5
use_shuffle: true
n_workers: 3 # per GPU
batch_size: 120
GT_size: 128
LQ_size: 128
use_flip: true
use_rot: true
color: RGB
#### network structures
network_G:
which_model_G: EDVR
nf: 128
nframes: 5
groups: 8
front_RBs: 5
back_RBs: 10
predeblur: false
HR_in: true
w_TSA: false
#### path
path:
pretrain_model_G: /data/denis/EDVR-master/experiments/001_EDVRwoTSA_scratch_lr4e-4_600k_REDS_LrCAR4S_archived_200606-082253/models/150000_G.pth
strict_load: false
resume_state: ~
#/data/denis/EDVR-master/experiments/001_EDVRwoTSA_scratch_lr4e-4_600k_REDS_LrCAR4S/training_state/150000.state
#### training settings: learning rate scheme, loss
train:
lr_G: !!float 4e-4
lr_scheme: CosineAnnealingLR_Restart
beta1: 0.9
beta2: 0.99
niter: 600000
warmup_iter: -1 # -1: no warm up
T_period: [150000, 150000, 150000, 150000]
restarts: [150000, 300000, 450000]
restart_weights: [1, 1, 1]
eta_min: !!float 1e-7
pixel_criterion: cb
pixel_weight: 1.0
val_freq: !!float 5e3
manual_seed: 0
#### logger
logger:
print_freq: 100
save_checkpoint_freq: !!float 5e3
Pretrained model G is same but nf=64
Hi Xinntao,
Can you please comment on how to specifically deal with the offset warning in the PCD alignment module? I am trying to train an 'L' model and have ambiguities from the details I could piece together from issues #16 and #22. I have put together the workflow required to train an 'L' model below, can you please comment about its correctness.