wusize / CLIPSelf

[ICLR2024 Spotlight] Code Release of CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense Prediction
https://arxiv.org/abs/2310.01403
Other
161 stars 9 forks source link

question about the results in the paper #14

Closed jayaylee2 closed 6 months ago

jayaylee2 commented 6 months ago

Hi, thank you for your impressive work.

I have a small question about the result for figure 1 in the original paper where you conducted image classification and region classification using 'Image Crop' and 'Dense Feature'.

As far as i understand, for 'Image Crop', you input the cropped image to get the output features and 'global pool' them (instead of using the cls token) to get the final output feature for zero-shot image classification using CLIP text encoder.

And for 'Dense Feature', you input the whole image, crop the output patch token features, and 'global pool' them and conduct the zero-shot image classification just like above.

Is my understanding correct, or is there something I'm missing out or misinterpreting?

Thank you.

wusize commented 6 months ago

Hi, the “Dense Feature “ part is correct. For the “Image Crop” part, we still use cls token for global image representation.

jayaylee2 commented 6 months ago

Thank you for your quick response! :)