Closed JosephPai closed 1 year ago
Hi,
The key 'titles' refers to multiple captions that were generated by a zero-shot captioning model specifically for each video in the dataset. These captions provide various descriptions or interpretations of the video content.
On the other hand, the key 'title' represents the caption from the 'titles' set that has the highest similarity to the video based on CLIP-based similarity. This particular caption is selected to serve as an additional training sample pair.
Please let me know if you have any further questions.
Hi,
Thanks for your response.
If I understand correctly, in Table 5 ablation study of different number of captions for data augmentation, the top-1 result comes from the 'title' key, while the top-3 and top-5 come from the 'titles' key, right?
If so, are the multiple captions in the 'titles' organized in the order sorted by CLIP similarity?
Thank you!
Hi,
Thanks for your response.
If I understand correctly, in Table 5 ablation study of different number of captions for data augmentation, the top-1 result comes from the 'title' key, while the top-3 and top-5 come from the 'titles' key, right?
If so, are the multiple captions in the 'titles' organized in the order sorted by CLIP similarity?
Thank you!
Yes, you are right.
Hi,
Thanks for releasing the code and data. I have checked the provided caption data and I found that there are two additional keys in the dataset, 'title' and 'titles'. Can you provide some explanation about them? Like how did you get the data, from url or captioning model? And what is the difference between the two sets?
Thank you!