question about the results in the paper

jayaylee2 commented 6 months ago

Hi, thank you for your impressive work.

I have a small question about the result for figure 1 in the original paper where you conducted image classification and region classification using 'Image Crop' and 'Dense Feature'.

As far as i understand, for 'Image Crop', you input the cropped image to get the output features and 'global pool' them (instead of using the cls token) to get the final output feature for zero-shot image classification using CLIP text encoder.

And for 'Dense Feature', you input the whole image, crop the output patch token features, and 'global pool' them and conduct the zero-shot image classification just like above.

Is my understanding correct, or is there something I'm missing out or misinterpreting?

Thank you.

wusize commented 6 months ago

Hi, the “Dense Feature “ part is correct. For the “Image Crop” part, we still use cls token for global image representation.

jayaylee2 commented 6 months ago

Thank you for your quick response! :)

wusize / CLIPSelf

question about the results in the paper #14