xmed-lab / CLIP_Surgery

CLIP Surgery for Better Explainability with Enhancement in Open-Vocabulary Tasks
367 stars 26 forks source link

Question related to different image resolution from 224x224 #21

Closed Reasat closed 1 year ago

Reasat commented 1 year ago

Hi, thanks for the awesome repo.

I had a question about how clip processes the images that are different then 224x224, specifically for the high resolution images in the demo. When I load the vit-b-16 model, and print the shape for the model.visual.positional_embedding, it is 197x768. Then when I encode the high res image (512x512) using model.visual, I notice that the shape of the positional embeddings automatically change to greater then 197 to be compatible with the image tokens. Can you tell me where in the code the positional embeddings is changing dynamically with the input image?

Eli-YiLi commented 1 year ago

Hi,

Thanks for your interest. I update the positional embedding via bilinear interpolation in the forward functions in the clip_surgery_model.py as below:

image