zengyan-97 / X2-VLM

All-In-One VLM: Image + Video + Transfer to Other Languages / Domains (TPAMI 2023)
BSD 3-Clause "New" or "Revised" License
123 stars 10 forks source link

Could you explain the details of data selection for large scale dataset? #1

Closed SCZwangxiao closed 1 year ago

SCZwangxiao commented 1 year ago

Excellent work!And I have a question about data selection.

In the dataset section, you adopted data preprocessing and filtering to speed up training.

What is the proprecessing and filtering strategy? Since the pretraining models generally obey the data scaling rule, I think it would make a great difference to results.

zengyan-97 commented 1 year ago

Thanks for your reminder. This part will be added to the updated paper.

In fact, we didn’t do preprocessing. we only did filtering to speed up pre-training.

For LAION, we used English data only. Following BLIP, we removed an image if the shorter edge is smaller than 224 pixels. We also removed an image if (height/width) or (width/height) is larger than 3.

For video clip-text pairs, we removed a pair if the number of words is less than 2. Following previous work (I don’t remember which one…I need to check it later), we used CLIP score to filter data. We sampled a frame for a video clip and we calculated the CLIP score between the frame and the text. We removed a video clip-text pair if the score is less than 0.25.