benchmarking against the official caffe implementation

yyuanad / Pytorch_C3D_Feature_Extractor

Pytorch C3D feature extractor

130 stars 35 forks source link

benchmarking against the official caffe implementation #2

Closed devraj89 closed 6 years ago

devraj89 commented 6 years ago

Thanks for the code and also the pre-trained model. Have you trained the code yourself or did you port it from the official implementations? Also, did you try to benchmark it against the official implementations?

I will be glad to hear from you!

Thanks Devraj

yyuanad commented 6 years ago

Actually, I downloaded the pre-trained model from someone's Github, but I forgot where the source was. (Sorry I should, but I really forgot about it!)

I didn't try to benchmark it against the official implementations. I just used the pre-trained model directly and wrote code to extract C3D features for other downstream applications.

Here is a video captioning application that my colleague used the C3D feature extracted with this code and pre-trained model. He got the state-of-the-art results on the video captioning task at that time, and the paper was accepted to CVPR 2018 (oral).

devraj89 commented 6 years ago

Thanks @yyuanad for your prompt reply ! I managed to understand the feature_extractor_frm.py code working fine and was able to extract the features from frames.

However I am not able to run the code feature_extractor_vid.py. Can you please specify the requirements to run that code ? Do we need to install ffmpeg ? Also I am a bit confused about what feature to extract. If the number of frames > 16 then the code tries to extract multiple copies of the c3d features for the same video. But ultimately which feature to use ?

Any help will be appreciated ! Thanks Devraj

yyuanad commented 6 years ago

Yes, you should install ffmpeg. Actually, the code for a video and the code for the frames is almost the same, except that we need to decode the video first. My implementation is first online decoding the video into frames with ffmpeg and then processes the decoded frames. When the video features are extracted, we then delete the decoded frames to free the space.

The C3D feature is for each 16-frame of a video. For example, if the video has 32 frames, then there are two C3D feature vectors extracted for frame 1-16 and frame 17-32, respectively.