yaoli / arctic-capgen-vid

automatic video description generation with GPU training
Other
260 stars 91 forks source link

Functions to generate captions on new dataset? #13

Open xinleipan opened 7 years ago

xinleipan commented 7 years ago

which function should I use to generate caption on new video data?

zhuolinumd commented 7 years ago

Have you figured out how to generate the captions for new dataset? @xinleipan @yaoli can you provide some guidance?

yaoli commented 7 years ago

Basically one needs to follow the preprocessing steps mentioned in the paper, turning a new video clip into a series of frames, each represented by a 1024 dim vector.

zhuolinumd commented 7 years ago

Thank you @yaoli. I understand this is for frame-wise feature (2D CNN features). But, how about the spatio-temporal 3-D CNN features?

You mentioned in the paper "When using 3-D CNN without temporal attention, we simply use the 2500-dimensional activation of the last fully-connection layer. When we combine the 3-D CNN with the temporal attention mechanism, we leverage the last convolutional layer representation leading to 26 feature vectors of size 352. Those vector are contatenated with the 2D CNN features resulting in 26 feature vectors with 1376 elements."

Is your code only using 2-D CNN features? Should we extract the 3-D CNN features as well? Thanks.

yaoli commented 7 years ago

The code is only with 2D CNN features, so no need to extract 3D CNN features.

zhuolinumd commented 7 years ago

Thanks @yaoli . I run your training code (train_model.py). It stopped at epochs 86 and finished the training and testing. The testing results are pretty good. Why the results are so good?

I got the results from the last row of "train_valid_test.txt" test_B1, test_B2, test_B3, test_B4, test_meteor, test_Cider 0.7927, 0.6691, 0.5727, 0.4730, 0.3187, 0.6907

But in the Table 1 of your paper: BLEU METEOR CIDE 0.4192, 0.2960, 0.5167

I am wondering why the results are better even you did not use local temporal 3D features. Can you explain? Do you mean the model "Enc-Dec + Global (attention mechanism)" achieve the best performance?

Thanks for your time!

Chilicy commented 6 years ago

@xinleipan @jiang2764 @yaoli sorry to bother you, I want to generate the pkl files for new dataset , do you have the generation scripts? Thanks!