zmykevin / UC2

CVPR 2021 Official Pytorch Code for UC2: Universal Cross-lingual Cross-modal Vision-and-Language Pre-training
MIT License
34 stars 3 forks source link

Results for zero-shot cross-lingual image-text retrieval #5

Closed ghchen18 closed 2 years ago

ghchen18 commented 2 years ago

Hi,

Thanks for your great work. Have you tried the zero-shot cross-lingual image-text retrieval task like that in M3P? i.e., after pretrained on conceptual captions dataset (CC3m), directly tested in COCO-ZH or COCO-JA task, without fine-tuning on the English COCO training data. If yes, could you share the results?

Many thanks!

zmykevin commented 2 years ago

Hello, unfortunately, we did not do zero-shot testing on COCO. But you should be able to do this experiment easily by just using the pre-trained model checkpoint we provide to test on these two datasets. Also, check this paper, they report the zero-shot performance of our model without any fine-tuning on Multi-30K+COCO in Table 4. However, their UC2 is a re-implementation of our method, which may not reflect the same number by using my provided checkpoint.

ghchen18 commented 2 years ago

Got it. Thanks for your help.

ghchen18 commented 2 years ago

Hi,

sorry to open the issue again. I have some other questions related to the experiments.

what is the test split of En and Ja for the COCO image-text retrieval in Table 1? are they karpathy test split of 5k images and 25k captions? or you use the 1k test split?

The paper says 'We use the train/dev/test splits for English and Japanese defined in [27], and present results on the 1K test set.' So it should be of the 1k test set, which means the COCO-En and COCO-Ja retrieval results are with 1k images and 1k captions? Am I right?

By the way, The code here https://github.com/zmykevin/UC2/blob/master/config/uc2_mscoco_itm.json#L65, has multiple 1k text per db folder, do you also try the 5k test split and split them into 5 x 1k images, then retrieval among 1k images ?

Thanks a lot.

zmykevin commented 2 years ago

Correct it is test on 1k split. We evenly divide the 5k test images into five 1k test images, and report average results on these five splits. Check out this post to see how the 1k results are computed: https://github.com/fartashf/vsepp/issues/26

ghchen18 commented 2 years ago

Get it. Many thanks.