zmykevin / UC2

CVPR 2021 Official Pytorch Code for UC2: Universal Cross-lingual Cross-modal Vision-and-Language Pre-training
MIT License
34 stars 3 forks source link

About the pretraining data #2

Closed shizhediao closed 2 years ago

shizhediao commented 2 years ago

Hi, Thanks for your great work! I am going to reproduce the results but could not find the pertaining data, i.e., CC3M and its translations in five languages. Would you like to release them or I need to translate them by myself?

Thanks!

zmykevin commented 2 years ago

Hi, Here is the link to download the 3M translations in all 6 languages. Will include this link in our README as well.

Best Regards, Mingyang

shizhediao commented 2 years ago

Thanks a lot!

shizhediao commented 2 years ago

BTW, may I ask where could I download the datasets of downstream tasks? Thanks

zmykevin commented 2 years ago

Hi shizhe, currently we only released the MSCOCO dataset for image-text retrieval finetuning. You should be able to get the Text database by using this command:

wget https://mmaisharables.blob.core.windows.net/uc2/UC2_DATA.tar.gz

You can get the image features from UNITER's github repo The VQA English dataset can be also obtained from Uniter.
We plan to release the data for VQA Japanese later. Please stay tuned.

shizhediao commented 2 years ago

Thanks! One more question, as for the VQA Japanese data, is it the same as the original data? Or you did some processing steps based on it. If they are same, I could directly use it.

zmykevin commented 2 years ago

The split of the original VG VQA data is lost. So we have re-split their data based on the split of the English VG VQA data. This information is mentioned in our paper. You can definitely process their data by yourself.

shizhediao commented 2 years ago

OK, got it!

chenQ1114 commented 2 years ago

Hi,

May I ask the data "img_token_soft_label" in the config/uc2_pretrain.json file? I guess the "/img/gcc_train" in the json file means the path of CC image features. But I dont know what is the ""img_token_soft_label" and how to get it. Checked the UC2_DATA.tar.gz but can not find it.

zmykevin commented 2 years ago

Hi, the data "img_token_soft_label" is actually not used in the pre-training code. It was one of the pre-training experiment settings that we tried but did not use it in our final pre-training.

chenQ1114 commented 2 years ago
        Thanks! On 02/17/2022 23:21, Mingyang Zhou wrote: 

Hi, the data "img_token_soft_label" is actually not used in the pre-training code. It was one of the pre-training experiment settings that we tried but did not use it in our final pre-training.

—Reply to this email directly, view it on GitHub, or unsubscribe.Triage notifications on the go with GitHub Mobile for iOS or Android. You are receiving this because you commented.Message ID: @.> [ { @.": "http://schema.org", @.": "EmailMessage", "potentialAction": { @.": "ViewAction", "target": "https://github.com/zmykevin/UC2/issues/2#issuecomment-1043063351", "url": "https://github.com/zmykevin/UC2/issues/2#issuecomment-1043063351", "name": "View Issue" }, "description": "View this Issue on GitHub", "publisher": { @.***": "Organization", "name": "GitHub", "url": "https://github.com" } } ]