whwu95 / Cap4Video

【CVPR'2023 Highlight & TPAMI】Cap4Video: What Can Auxiliary Captions Do for Text-Video Retrieval?
https://arxiv.org/abs/2301.00184
MIT License
225 stars 16 forks source link

Question of the caption file. #5

Closed JosephPai closed 1 year ago

JosephPai commented 1 year ago

Hi,

Thanks for releasing the code and data. I have checked the provided caption data and I found that there are two additional keys in the dataset, 'title' and 'titles'. Can you provide some explanation about them? Like how did you get the data, from url or captioning model? And what is the difference between the two sets?

Thank you!

whwu95 commented 1 year ago

Hi,

The key 'titles' refers to multiple captions that were generated by a zero-shot captioning model specifically for each video in the dataset. These captions provide various descriptions or interpretations of the video content.

On the other hand, the key 'title' represents the caption from the 'titles' set that has the highest similarity to the video based on CLIP-based similarity. This particular caption is selected to serve as an additional training sample pair.

Please let me know if you have any further questions.

JosephPai commented 1 year ago

Hi,

Thanks for your response.

If I understand correctly, in Table 5 ablation study of different number of captions for data augmentation, the top-1 result comes from the 'title' key, while the top-3 and top-5 come from the 'titles' key, right?

If so, are the multiple captions in the 'titles' organized in the order sorted by CLIP similarity?

Thank you!

whwu95 commented 1 year ago

Hi,

Thanks for your response.

If I understand correctly, in Table 5 ablation study of different number of captions for data augmentation, the top-1 result comes from the 'title' key, while the top-3 and top-5 come from the 'titles' key, right?

If so, are the multiple captions in the 'titles' organized in the order sorted by CLIP similarity?

Thank you!

Yes, you are right.