Could you explain the details of data selection for large scale dataset？

zengyan-97 / X2-VLM

All-In-One VLM: Image + Video + Transfer to Other Languages / Domains (TPAMI 2023)

BSD 3-Clause "New" or "Revised" License

123 stars 10 forks source link

Thanks for your reminder. This part will be added to the updated paper.

In fact, we didn’t do preprocessing. we only did filtering to speed up pre-training.

For LAION, we used English data only. Following BLIP, we removed an image if the shorter edge is smaller than 224 pixels. We also removed an image if (height/width) or (width/height) is larger than 3.

For video clip-text pairs, we removed a pair if the number of words is less than 2. Following previous work (I don’t remember which one…I need to check it later), we used CLIP score to filter data. We sampled a frame for a video clip and we calculated the CLIP score between the frame and the text. We removed a video clip-text pair if the score is less than 0.25.

zengyan-97 / X2-VLM

Could you explain the details of data selection for large scale dataset？ #1