Questions about the released dataset

zyf0619sjtu / DreamLIP

[ECCV 2024] Official PyTorch implementation of DreamLIP: Language-Image Pre-training with Long Captions

https://zyf0619sjtu.github.io/dream-lip/

105 stars 2 forks source link

Questions about the released dataset #3

Closed Espere-1119-Song closed 1 month ago

Espere-1119-Song commented 6 months ago

Hi, thanks for your great contribution of the dataset. When I download cc3m, the index of my images seems different from yours.

for example, when point to '0000000/0000008.jpg', the caption of mine is "# of the sports team skates against sports team during their game .", while the 'raw_caption' of yours is ''modern luxe has a very simple look , and offers a bold monogram of the couple 's initials .". I download CC3M via img2dataset.

I think providing another file with 'image_path' and its respective 'url' is a potential solution. Can you provide it? Thanks!!

zkcys001 commented 4 months ago

It is a good question.

I have provided a json file in google drive, and the example of this file is: ( 'image_path' is key and 'url' is value in a dictionary)

https://drive.google.com/file/d/1iRaYLxrW_pHODzMpvIvjSNfpsM_9OGiL/view?usp=sharing

Moreover, we will release CC12M in this week~

Best Kecheng

kim-sanghwan commented 3 months ago

Thank you for your great work. I tried to used your google drive link but it shows me below. Can you share the valid link for "cc3m_path2url.json" again?

zkcys001 commented 3 months ago

We have updated the CC3M&12M files as follows: 3 types of short captions have been added, and we replace 'path' to image ‘url’ in these csv files.

CC3M: https://drive.google.com/file/d/1RPcFS8jrVolA9RzHXD581E8BxR7jYDap/view?usp=sharing CC12M: https://drive.google.com/file/d/12iUhceznPNWd-l_bGSF5rSnzdruP4Jtr/view?usp=sharing