Open srymaker opened 2 months ago
Hi, you can find the raw vision token length in two ways:
1) Print the output shape of the visual features from the ViT in an MLLM, which looks like (batch size, number of visual tokens, token dimension);
2) Calculate it directly using the input image resolution (e.g., 384px) and the ViT patch size (e.g., 14x14). For example, with a 384px input image and the openai/clip-vit-large-patch14
model, where 'Patch 14' means the image is divided into 14x14 pixel patches, the visual token length would be (384//14)**2 = 729.
Thank you for your patience.
Thanks for your great work! I wanna know how you compute the raw token lens, just like the 729 in the image.