zmykevin / UC2

CVPR 2021 Official Pytorch Code for UC2: Universal Cross-lingual Cross-modal Vision-and-Language Pre-training
MIT License
34 stars 3 forks source link

Questions about the setting of VG VQA JA #4

Closed zengyan-97 closed 2 years ago

zengyan-97 commented 2 years ago

Hi,

I tried to reproduce UC2 on VG VQA JA, but I got accuracy of ~25% instead of the reported ~34%.

I followed UC2 paper to preprocess the data and I submitted an issue about data split before (thank you again for replying), but I got 37674 for test instead of 30K as the paper said. So, my first question is: did you filter the testing data? can you share the processed data?

Besides, I found that there are many answers in top-3000 frequent answers have very similar meaning. So, the model made these "wrong" predictions, which should have been viewed as corrected: gt: 2人, preds: 2人
gt: 1本, preds: 1本 gt: 緑, preds: 緑色 gt: 赤, preds: 赤色 gt: 白色, preds: 白 gt: 白, preds: 白色 gt: 一本, preds: 1本 gt: 1つ, preds: 1個 gt: 1, preds: 1つ gt: 2本, preds: 2つ gt: 1本, preds: 1つ

So, my second question is: did you pick or process the top-3000 frequent answers by some strategies? can you share the list of top-3000 frequent answers that you chose?

Thanks!

zmykevin commented 2 years ago

Hello Zeng, Sorry to reply late. I don't remember if there is further processing we do to filter data. I think I made a mistake in reporting the scale of the validation set. It is indeed 37K instead of 30K images. I will update the paper to fix this mistake, and will also share the Japanese VQA data too. Regarding the top-3000 frequent answer, we indeed do some postprocessing which will to some extent address the mistake you found in the answer. To ease your work, I will directly share with you the final 3000 answers we use for our experiment. https://drive.google.com/drive/folders/1BTL6nGe2YIOHEK5PqGO8UCUjTwiQ13d8?usp=sharing

zengyan-97 commented 2 years ago

Thank you very much!