wusize / CLIPSelf

[ICLR2024 Spotlight] Code Release of CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense Prediction
https://arxiv.org/abs/2310.01403
Other
149 stars 8 forks source link

CAT-Seg's training setting #4

Open BuRr-Lee opened 7 months ago

BuRr-Lee commented 7 months ago

i want to know how do you train the CAT-Seg model after replacing its clip with eva (trained using clipself method). especially, do you freeze the swin transformer backbone? do you finetune only attention layers in clip?

wusize commented 7 months ago

Hi, we use the model from OpenAI for the results on Cat-Seg. The training of Cat-Seg remain unchanged, where attention layers of CLIP are finetuned.

BuRr-Lee commented 7 months ago

you mean you first train the original OpenAI-pretrained CLIP (instead of EVA-CLIP) model with CLIPSelf method, then apply it to CAT-Seg?

wusize commented 7 months ago

you mean you first train the original OpenAI-pretrained CLIP (instead of EVA-CLIP) model with CLIPSelf method, then apply it to CAT-Seg?

Yeah. We need to make sure the comparison with Cat-Seg is under the fair setting.