wusize / CLIPSelf

[ICLR2024 Spotlight] Code Release of CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense Prediction
https://arxiv.org/abs/2310.01403
Other
160 stars 9 forks source link

Would you share openai ViT-L weights trained by CLIPSelf? #25

Open jw00oo1 opened 1 month ago

jw00oo1 commented 1 month ago

Hello, Thank you for your great work.

I have a couple of specific questions regarding the experiment settings and results:

  1. Are the OpenAI model training settings identical to the settings in the bash files for the Eva model located in the scripts directory?
  2. I trained the OpenAI model with the same bash file settings as the ViT-L model, but I find the results are consistently lower than those reported in the paper. Could you share the checkpoint for the OpenAI ViT-L model?

Thank you!

wusize commented 1 month ago

Hi, please do not use the exact EVA setting for openai vit-large. I remember the training of openai vit-large was very costly of memory and slow. So I set the image size as 672 and only unfroze the last 6 or 12 layers. I will re-run this experiment myself and get back to you soon.

jw00oo1 commented 1 month ago

Thank you for your kind response! I look forward to you sharing the checkpoint.