I had a question about how clip processes the images that are different then 224x224, specifically for the high resolution images in the demo.
When I load the vit-b-16 model, and print the shape for the model.visual.positional_embedding, it is 197x768. Then when I encode the high res image (512x512) using model.visual, I notice that the shape of the positional embeddings automatically change to greater then 197 to be compatible with the image tokens. Can you tell me where in the code the positional embeddings is changing dynamically with the input image?
Hi, thanks for the awesome repo.
I had a question about how clip processes the images that are different then 224x224, specifically for the high resolution images in the demo. When I load the vit-b-16 model, and print the shape for the model.visual.positional_embedding, it is 197x768. Then when I encode the high res image (512x512) using model.visual, I notice that the shape of the positional embeddings automatically change to greater then 197 to be compatible with the image tokens. Can you tell me where in the code the positional embeddings is changing dynamically with the input image?