About the Inception V3 model pretrained on Kinetics

bityangke commented 7 years ago

Hi Yuanjun, we have checked the caffe model and prototxt file (Inception V3) you released a few days ago. I found that all the decomposed convolution filters(7x7 -> 1x7, 7x1) dimensions are contrary to the inception v3 model I have used （both caffe and tf）e.g. your kernel size are 1x7, 7x1 while they have 7x1, 1x7, yours are 7x1, 1x7 while they have 1x7,7x1. It makes me very confused, because I want to convert your Inception V3 model weights to tensorflow and keras model weights, to let more guys to enjoy the perfect work. I have "converted" the weights to keras(tensorflow backend), but the video test results are not correct.

yjxiong commented 7 years ago

Thanks for the issue. Let me have a look.

Actually, the model architecture is extracted from the released tensorflow protobuf file in Dec. 2016. We retrained the model weights on ImageNet and Kinetics using Caffe.

bityangke commented 7 years ago

@yjxiong Hi Yuanjun, Could you please tell me what mean value you use for your model(rgb and flow)? Thanks!

yjxiong commented 7 years ago

Hi @bityangke

I think you finding is correct. It's possible that during extraction, I switched the order of the separable filter sizes. But this will not affect the performance as long as we keep to the structure in the Caffe proto. The model achieved 94.15% single crop top-5 accuracy on ILSVRC12. You can modify the original InceptionV3's code to adapt to this change if you would like to use the weights in TF.

The mean values are the same as of other released TSN models.

[104 117 123] for RGB
128 for Flow

bityangke commented 7 years ago

Thanks very much！

bityangke commented 7 years ago

One more question. What input image size you use to crop the 299 x 299 image patch? I think it was 341 x 452, but I am not sure about this. Thanks in advance!

yjxiong commented 7 years ago

Yes, you are right. For testing 10-crops, the images are first scaled to width 452 and height 341.

bityangke commented 7 years ago

Do you have any results of these two models finetuing on UCF101? Thanks.

yjxiong commented 7 years ago

@bityangke We have updated the website with finetuning performance on UCF101.

You can find it on http://yjxiong.me/others/kinetics_action/#transfer

bityangke commented 7 years ago

Thanks very much! The performance is amazing! How did you schedule the learning rate for both nets？

Tonyfy commented 6 years ago

@bityangke maybe just keep the learning rate setting same with original TSN training(pretrained by ImageNet).I am launching the finetuning only changing the weights file.

bityangke commented 6 years ago

Thanks for sharing your experiences. @Tonyfy I will try it!

yjxiong commented 6 years ago

For UCF101, I simply used the 0.001 initial learning rate, decayed by 10 times every 10 epochs, for a total of 30 epochs. All BN layers are fixed. No other change is needed.

Tonyfy commented 6 years ago

Hi~ @yjxiong I check that your reported pretrain performance on UCF101 as below:

Model | Pretraining | RGB | Flow | RGB+Flow
-- | -- | -- | -- | --
BNInception | ImageNet only | 85.4% | 89.4% | 94.9%
BNInception | ImageNet + Kinetics | 91.1% | 95.2% | 97.0%

how to fine-tuning with ImageNet+ Kinetics? First use imageNet pretrained model to fine-tuning on Kinetics,then use the final model to fine-tuning on UCF101?

I use the Kinetics pretrained model you released to fine-tuning on UCF101 split1 with rgb modality ,but only obtain 34% accuray. lots of thanks~

yjxiong commented 6 years ago

@Tonyfy

They are achieved by fine tuning the released models on UCF101. Please check your settings.

Tonyfy commented 6 years ago

@yjxiong I am fine-tuning the tsn_bn_inception_rgbmodel on ucf101_split1 with kinetics pretrained model bn_inception_kinetics_rgb_pretrained.caffemodel, here are some settings modified: 1, in tsn_bn_inception_rgb_train_val.prototxt, change the bn_param frozen from "false" to "true" in conv1(other BN layers have set frozen to true already). 2, in tsn_bn_inception_rgb_solver.prototxt, change stepsize from 1500 to 5000(10epoch),change max_iter from 3500 to 16000(30epochs and 1000 more iters)

other setting keep remained, thus useing 4 GPUs, train_batch_size=32, test_interval: 500(a training epoch, 500 32 4 3(seg_num)~=9537(videos)25(snippet)),gamma=0.1 the fine-tuning result as below:

I0930 12:20:45.154777 14451 solver.cpp:240] Iteration 15960, loss = 0.786447
I0930 12:20:45.154932 14451 solver.cpp:255]     Train net output #0: loss = 0.757018 (* 1 = 0.757018 loss)
I0930 12:20:45.154943 14451 solver.cpp:640] Iteration 15960, lr = 1e-06
I0930 12:20:55.122572 14451 solver.cpp:240] Iteration 15980, loss = 0.824895
I0930 12:20:55.122633 14451 solver.cpp:255]     Train net output #0: loss = 0.842794 (* 1 = 0.842794 loss)
I0930 12:20:55.122649 14451 solver.cpp:640] Iteration 15980, lr = 1e-06
I0930 12:21:04.659654 14451 solver.cpp:511] Snapshotting to models/ucf101_split1_tsn_kinetics_rgb_bn_inception/_iter_16000.caffemodel
I0930 12:21:04.782093 14451 solver.cpp:519] Snapshotting solver state to models/ucf101_split1_tsn_kinetics_rgb_bn_inception/_iter_16000.solverstate
I0930 12:21:05.033846 14451 solver.cpp:415] Iteration 16000, loss = 0.483241
I0930 12:21:05.033885 14451 solver.cpp:433] Iteration 16000, Testing net (#0)
I0930 12:21:23.307076 14451 solver.cpp:490]     Test net output #0: accuracy = 0.345526
I0930 12:21:23.307193 14451 solver.cpp:490]     Test net output #1: loss = 3.61031 (* 1 = 3.61031 loss)
I0930 12:21:23.307201 14451 solver.cpp:420] Optimization Done.
I0930 12:21:23.307205 14451 caffe.cpp:203] Optimization Done.

yjxiong commented 6 years ago

@Tonyfy I don't know how you calculated the iterations, but 10 epochs should be around 700 iterations given the 128 batchsize and UCF101 training set size of 9000+. So the max epoch number should be around 2200. The learning rate will not be decayed to 1e-6 in this setting.

Also, the training loss looks way too high in your log, even higher than the case of ImageNet pretraining. Make sure you are loading the correct pretrained weights using --weights.

Tonyfy commented 6 years ago

@yjxiong, Thanks for your reply, i will try your settings and do careful check.

and Happy National Day & Moon Festival.

whwu95 commented 6 years ago

@yjxiong hello ,you said, For UCF101, you simply used the 0.001 initial learning rate, decayed by 10 times every 10 epochs, for a total of 30 epochs. I have two question. First ,I know 0.001 is for spatial net ,how about the temporal net? because I find the lr is 0.001 is for RGB and 0.005 is for FLOW in the the original TSN solver. Second, I find the batch size is 128(32x4x1) in the the original TSN solver, and RGB solver is decayed by 10 times every 20 epochs(1500 iter), for a total of 45 epochs(3500 iter).FLOW solver is decayed by 10 times for [10000, 16000], for a total of 18000 iter. This is different from decayed by 10 times every 10 epochs, for a total of 30 epochs. Happy National Day & Moon Festival.

yjxiong commented 6 years ago

@whwu95

As I said the flow model uses the same setting.

This difference in learning strategy comes naturally when observing the much smaller domain gap in using Kinetics for pretraining (video recognition to video recognition v.s. image object recognition to video recognition).

The is more evident for the flow model. For the ImageNet only case, the flow model is initialized by cross modality pretraining and takes a lot of iteration to adapt to flow input while requiring higher learning rates. However, this process is already done in pretraining of the Kinetics pretrained models and not

yjxiong commented 6 years ago

needed later on

bityangke commented 6 years ago

I have tuned the InceptionV3 RGB Net on UCF101 split 1，ending in 93.18% accuracy（25 frms 10crops）. But flow net's accuracy is only about 91%. I save the flow image as .jpeg file with the video original resolution. I resize them to 341x453 when used. Does it has any problem? I think it might matter when the motion vector is very small.

bityangke commented 6 years ago

When the motion vector is very small, it might be influenced by this loss from resizing.

yjxiong commented 6 years ago

Resizing is OK from our experiences. We also use on-the-fly resizing for Inception V3 models.

zhujiagang commented 6 years ago

@yjxiong @bityangke In orginal TSN, mean value is [104 117 123] for RGB not [104 117 128]

yjxiong commented 6 years ago

@zhujiagang Noted with thanks.

inakam commented 6 years ago

Hi @yjxiong @bityangke Thank you for publishing the wonderful model. However, because I use Keras, I can not use it. (and I could not convert well) So, Could you release the converted model for Keras?

Thank you.

yjxiong / temporal-segment-networks

About the Inception V3 model pretrained on Kinetics #132