Reproducing the inference

csantosbh commented 5 years ago

I'm currently trying to reproduce the inference in arbitrary videos of the HMDB51 dataset by using the pretrained weights.

So far, I have:

Compiled OpenCV with the contrib module and cuda support
Successfully compiled caffe (caffe-tpp-net) with CUDA and python support
Set up the PYTHONPATH environment variable so python2.7 may find the compiled modules
Downloaded the pretrained weights with get_init_models.sh and get_kinetics_pretraining_models.sh

Since README.md doesn't mention how to perform inference, I read README_old.md and found that there is a script tools/eval_net.py for that purpose. So I ran:

python tools/eval_net.py hmdb51 1 rgb /var/datasets/hmdb51/ models/hmdb51/flow_tpp_delete_dropout_deploy.prototxt ./init_models/hmdb51_split_1_tsn_flow_reference_bn_inception.caffemodel

But, after a ton of messages, that gave me the following error:

Traceback (most recent call last): File "tools/eval_net.py", line 125, in video_scores = map(eval_video, eval_video_list) File "tools/eval_net.py", line 69, in eval_video video_frame_path = f_info[0][vid] KeyError: '20060723sfjffbumblebeesuitman_run_f_cm_np2_ri_med_1'

The path to the dataset is correct, and I have unpacked all rar files from the HMDB dataset. The README_old.md file mentions a script scripts/extract_optical_flow.sh for preprocessing the video files, but in the master branch, this script doesn't exist.

So, are there any further steps necessary for reproducing inference?

zhujiagang commented 5 years ago

Sorry for the trouble brought to you in running the code due to my unseriousness. I've update the README.md and you could refer to the eval_net_tpp_hmdb.py for more details. Our trained models on UCF101 and HMDB51 have not been released.

csantosbh commented 5 years ago

Hi @zhujiagang, thank you so much for your swift replies! I'm currently working on reproducing the training with the hmdb_scripts_split_1/train_rgb_tpp_delete_dropout_split_1.sh script on a Titan X. I had a few issues and had to modify some files to avoid them, namely:

Reduced batch size to 1 under models/hmdb51/rgb_tpp_delete_dropout_split_1_train_val.prototxt (for some reason I was getting out of memory error on my 12GB Titan X and this seemed to fixed it, can you tell me about the GPU you used for your training? If your GPU also had 12GB or less of memory, then there's something wrong in my setup)
Added new_width: 224 and new_height: 224 to all video layers under models/hmdb51/rgb_tpp_delete_dropout_split_1_train_val.prototxt (for some reason, I was getting an error caffe data_transformer.cpp:491] Check failed: width <= datum_width when I tried to train... did you manually resize the image frames when you extracted them from the videos? For each video on HMDB, I performed frame extraction with the following command: mkdir "<PATH>"; ffmpeg -i "<PATH>.avi" "<PATH>/img_%05d.jpg")
Changed the device id to from 3 to 0 under models/hmdb51/rgb_tpp_delete_dropout_split_1_solver.prototxt
Created the folder snapshot in the repo root, otherwise caffe was crashing when trying to save snapshots

I've been running the training for several hours now and this is the latest iteration output:

I1031 15:06:09.097961  4431 solver.cpp:625] Gradient clipping: scaling down gradients (L2 norm 877.991 > 40) by scale factor 0.0455585
I1031 15:06:21.079704  4431 solver.cpp:625] Gradient clipping: scaling down gradients (L2 norm 700.793 > 40) by scale factor 0.0570782
I1031 15:06:33.580734  4431 solver.cpp:625] Gradient clipping: scaling down gradients (L2 norm 1143.5 > 40) by scale factor 0.0349803
I1031 15:06:46.480859  4431 solver.cpp:625] Gradient clipping: scaling down gradients (L2 norm 1017.31 > 40) by scale factor 0.0393192
I1031 15:06:58.553812  4431 solver.cpp:625] Gradient clipping: scaling down gradients (L2 norm 1014.26 > 40) by scale factor 0.0394377
I1031 15:07:10.659497  4431 solver.cpp:625] Gradient clipping: scaling down gradients (L2 norm 1247.51 > 40) by scale factor 0.0320639
I1031 15:07:23.238704  4431 solver.cpp:625] Gradient clipping: scaling down gradients (L2 norm 968.414 > 40) by scale factor 0.0413046
I1031 15:07:35.426038  4431 solver.cpp:625] Gradient clipping: scaling down gradients (L2 norm 1048.87 > 40) by scale factor 0.0381362
I1031 15:07:48.065246  4431 solver.cpp:625] Gradient clipping: scaling down gradients (L2 norm 1141.9 > 40) by scale factor 0.0350292
I1031 15:08:00.421540  4431 solver.cpp:625] Gradient clipping: scaling down gradients (L2 norm 734.556 > 40) by scale factor 0.0544547
I1031 15:08:12.666364  4431 solver.cpp:625] Gradient clipping: scaling down gradients (L2 norm 748.326 > 40) by scale factor 0.0534526
I1031 15:08:25.080652  4431 solver.cpp:625] Gradient clipping: scaling down gradients (L2 norm 1113.1 > 40) by scale factor 0.0359356
I1031 15:08:37.743530  4431 solver.cpp:625] Gradient clipping: scaling down gradients (L2 norm 878.843 > 40) by scale factor 0.0455144
I1031 15:08:49.748545  4431 solver.cpp:625] Gradient clipping: scaling down gradients (L2 norm 1046.37 > 40) by scale factor 0.0382275
I1031 15:09:01.976547  4431 solver.cpp:625] Gradient clipping: scaling down gradients (L2 norm 1059.37 > 40) by scale factor 0.0377582
I1031 15:09:14.561251  4431 solver.cpp:625] Gradient clipping: scaling down gradients (L2 norm 1073.01 > 40) by scale factor 0.0372784
I1031 15:09:27.069676  4431 solver.cpp:625] Gradient clipping: scaling down gradients (L2 norm 1176.19 > 40) by scale factor 0.0340082
I1031 15:09:39.748011  4431 solver.cpp:625] Gradient clipping: scaling down gradients (L2 norm 1173.07 > 40) by scale factor 0.0340987
I1031 15:09:51.674002  4431 solver.cpp:625] Gradient clipping: scaling down gradients (L2 norm 1058.21 > 40) by scale factor 0.0377996
I1031 15:10:04.277062  4431 solver.cpp:625] Gradient clipping: scaling down gradients (L2 norm 1022.46 > 40) by scale factor 0.0391213
I1031 15:10:16.263995  4431 solver.cpp:240] Iteration 4740, loss = 0.184195
I1031 15:10:16.264025  4431 solver.cpp:255]     Train net output #0: accuracy = 1
I1031 15:10:16.264046  4431 solver.cpp:255]     Train net output #1: loss = 0.0173458 (* 1 = 0.0173458 loss)
I1031 15:10:16.264051  4431 solver.cpp:640] Iteration 4740, lr = 0.001

The gradient scaling down and accuracy 1 worries me a bit, is that expected?

If I understand the paper correctly, I should also generate a temporal model by training with the script hmdb_scripts_split_1/train_flow_tpp_delete_dropout_split_1.sh, is that correct? Do I need to preprocess the input images somehow? Is that what the lib/dense-flow is there for? Is there a script for automating this process?

zhujiagang commented 5 years ago

About out of memory due to batch_size, how did you compile caffe-tpp-net? Have you installed openmpi before compiling caffe-tpp-net? Please refer to the command in TSN repo for more details of compiling caffe with openmpi. Because I've found that compiling their caffe with openmpi can save a lot of memory.

Which caffe do you use? I do not remember that I need to set new_width, new_height when using caffe-tpp-net. crop_size: 224 is enough.

I used the script in TSN repo to extract frames and extract optical flow. Please refer to the TSN repo.

I think your training process is fine with batch_size: 1. Once you could run with batch_size: 4, the training curve may change. I also want to remind you that I've decreased learning rate by hand when validation accuracy no longer increases as paper says

Instead of decreasing the learning rate according to a fixed schedule, the learning rate is lowered by a factor of 10 after validation error saturates.

A better way is to find a fixed schedule which is more appropriate for ablation study.

The base learning rate for rgb stream and flow stream you should also follow the paper.

csantosbh commented 5 years ago

Hello @zhujiagang, thanks again for your clear answer!

I probably needed to set new_width and new_height because I extracted myself the frames of the dataset using ffmpeg. When I proceeded to extract the optical flow using the TSN repo, I noticed that their script also extracted the frames themselves, so I suppose they resize the images to a minimum dimension.

Now I have another question. I have finished training both the RGB and the Flow models, and I'd like to combine these models to perform predictions like in the paper. Please correct me if I'm wrong, but the script eval_tpp_net_hmdb.py only seems to evaluate the RGB or the Flow models separately, unlike the full model described in your paper. Is there another script that combines both trained models to perform inference?

Thanks again!

zhujiagang commented 5 years ago

@csantosbh You can refer to the eval_scores_rgb_flow.py

csantosbh commented 5 years ago

I see! The eval_scores_rgb_flow.py complained about some missing .npz files, so I guess I should have changed the save_scores variable in the eval_tpp_net_hmdb.py script, right? So the whole process should be like this:

Open eval_tpp_net_hmdb.py, edit both net_weights and save_scores variables accordingly to run the RGB evaluation
Run eval_tpp_net_hmdb.py
Open eval_tpp_net_hmdb.py, edit both net_weights and save_scores variables accordingly to run the Flow evaluation
Run eval_tpp_net_hmdb.py
Open eval_scores_rgb_flow.py, edit the variable score_files to the names of the .npz files generated by steps 2 and 4
Run eval_scores_rgb_flow.py

Is that correct? I'm running inference again since my .npz file was overridden when I ran the flow evaluation (I only changed the net_weights variable).

zhujiagang commented 5 years ago

Yes!

csantosbh commented 5 years ago

Thank you @zhujiagang! I was able to reproduce the results of the paper with ~73% accuracy with the fusion model.

Now I'm trying to use the Kinects pre trained model to improve the accuracy as in the paper. What I did was edit the script hmdb_scripts_split_1/train_rgb_tpp_delete_dropout_split_1.sh and change the --weights flag to the file kinetics_pretraining_models/bn_inception_kinetics_rgb_pretrained/bn_inception_kinetics_rgb_pretrained.caffemodel (I would do something similar for the Flow trainer). However, caffe crashes when I try to run training like this. Is there anything else that needs to be changed?

zhujiagang commented 5 years ago

@csantosbh You should use the kinetics_hmdb_split_1/train_kinetics_rgb_tpp_p124_split_1.sh, because the authors of TSN provide kinetics pretraining model and prototxt in http://yjxiong.me/others/kinetics_action/ with different name from using imagenet pretraining.

zhujiagang / DTPP

Reproducing the inference #4