I have a small question about the result for figure 1 in the original paper where you conducted image classification and region classification using 'Image Crop' and 'Dense Feature'.
As far as i understand, for 'Image Crop', you input the cropped image to get the output features and 'global pool' them (instead of using the cls token) to get the final output feature for zero-shot image classification using CLIP text encoder.
And for 'Dense Feature', you input the whole image, crop the output patch token features, and 'global pool' them and conduct the zero-shot image classification just like above.
Is my understanding correct, or is there something I'm missing out or misinterpreting?
Hi, thank you for your impressive work.
I have a small question about the result for figure 1 in the original paper where you conducted image classification and region classification using 'Image Crop' and 'Dense Feature'.
As far as i understand, for 'Image Crop', you input the cropped image to get the output features and 'global pool' them (instead of using the cls token) to get the final output feature for zero-shot image classification using CLIP text encoder.
And for 'Dense Feature', you input the whole image, crop the output patch token features, and 'global pool' them and conduct the zero-shot image classification just like above.
Is my understanding correct, or is there something I'm missing out or misinterpreting?
Thank you.