Loss doesn't decrease when training optical flow model based on BNInception

un-knight commented 6 years ago

Thanks for your great job! But when I train TSN flow model on myself datasets(There are about 25000 training examples), the training loss and test loss cannot be reduced anymore when it decreased to about 1.8. After that, the training loss and test loss will stabilise at about 1.8, even though I have tried to decrease learning rate and increase training loop.

My training strategies are the same as what you write down on "readme.md".

python main.py ucf101 Flow <ucf101_flow_train_list> <ucf101_flow_val_list> \
   --arch BNInception --num_segments 3 \
   --gd 20 --lr 0.001 --lr_steps 190 300 --epochs 340 \
   -b 128 -j 8 --dropout 0.7 \
   --snapshot_pref ucf101_bninception_ --flow_pref flow_

I don't know why the training loss will get stuck in 1.8, and top1 accuracy of training set is only about 60%.

Does there any other methods that I can try to fix the proplem? Will Adam be more efficiency than SGD?

RyanCV commented 6 years ago

@un-knight @yjxiong when I run the training code I got the following error

"/usr/local/lib/python2.7/dist-packages/torch/nn/modules/module.py", line 721, in load_state_dict
    self.__class__.__name__, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for BNInception:
        While copying the parameter named "conv1_7x7_s2_bn.running_var", 
whose dimensions in the model are torch.Size([64]) and
 whose dimensions in the checkpoint are torch.Size([1, 64]).

The environment of my pytorch is:

>>> torch.__version__
'0.4.0'
>>> torchvision.__version__
'0.2.1'

any suggestion on solving this? thanks.

danielyou0230 commented 6 years ago

@RyanCV downgrading your pytorch from 0.4.0 to 0.3.1 solves the issue, worked for me!

Reference: https://github.com/Cadene/tensorflow-model-zoo.torch/issues/8

yjxiong / tsn-pytorch

Loss doesn't decrease when training optical flow model based on BNInception #51