zengyan-97 / CCLM

Cross-View Language Modeling: Towards Unified Cross-Lingual Cross-Modal Pre-training (ACL 2023))
BSD 3-Clause "New" or "Revised" License
87 stars 9 forks source link

Multi 30K #3

Closed shiyanlou-015555 closed 2 years ago

shiyanlou-015555 commented 2 years ago

I have collected multi30K from "https://github.com/multi30k/dataset", but En and DE are only 30K which is different from the paper. The results for EN and DE in zero-shot are 76.6 and 76.4, which are very different from the 83.7 and 79.1 given in the paper. When I use all the data from Flickr30K as EN, the EN result is 83.4, but the DE still works poorly. According to the paper, it should be 150K DE sentences, so how can I get the total DE data? In the paper, it says that "Multi30K contains 31,783 images and provides five captions per image in English and German and one caption per image in French and Czech"? Can you give the total data of multi30K or corresponding URL

shiyanlou-015555 commented 2 years ago
En De Fr Cs
30K 30K 30K 30K multi30k in github
150K 150K 30K 30K multi30k in your paper

can you give me the total data of Multi 30K you used? I promise they will be used only for research purposed.

zengyan-97 commented 2 years ago

Hi, Did you notice the fact that for En & De each image has 5 captions, while for Fr & Cs each image has only 1 caption?

shiyanlou-015555 commented 2 years ago

Yes, but "https://github.com/multi30k/dataset" doesn't seem to be the case, so where can I download the full multi30k from, or can you provide a copy? I promise they will be used only for research purposed.

zengyan-97 commented 2 years ago

Hi,

I checked that I downloaded the dataset from the same link, and you can get the correct multi30k dataset after some simple preprocess.

shiyanlou-015555 commented 2 years ago

Thank you,I will success sooner.