Closed vishaal27 closed 7 months ago
Hello~
Thanks for your attention. We will release the long caption of CC3M extracted by LLAVA1.5, InstructBLIP and ShareGPT4V of long caption in this week. For CC-12M and YFCC-15M datasets, we are organizing the generated long caption and plan to release them next month (may be 2024/4/10)~
Yep, the Merged-30M dataset with long-captions is simply a mixture of these three datasets with long and short captions.
For COYO-700M and LAION-400M, due to the limitation of GPU, we only extract long caption on laion20m and coyo4m via ShareGPT4V~and would like to release them next month too.
Thanks~
Kecheng
We released the long caption of CC3M extracted by LLAVA1.5, InstructBLIP and ShareGPT4V of long caption at https://drive.google.com/file/d/19jCNWvy7kA70u-ufQtEJvbKVMG2b8MnP/view?usp=drive_link (csv version)
If you have any questions, please feel free to contact me~
Best Kecheng
Hey it seems the drive link is private, could you please make it public? Thanks for releasing!
hey. We have made this link public.
Awesome, thanks!
Hi @zkcys001 ,
Thank you for the great work. The Google Drive link you shared is still private.
Hi,
Thanks for the great work---the paper is a delight to read and the results look very compelling. I was wondering whether you were planning on releasing the generated long and short captions both for the CC-3M, CC-12M and YFCC-15M datasets? As I understand, the Merged-30M dataset with long-captions is simply a mixture of these three datasets with long and short captions? Furthermore, I noticed that you had both COYO-700M and LAION-400M in the pipeline, are there plans to release the long captions for that too?