About the raw token lens

srymaker commented 2 months ago

Thanks for your great work! I wanna know how you compute the raw token lens, just like the 729 in the image.

yaolinli commented 2 months ago

Hi, you can find the raw vision token length in two ways: 1) Print the output shape of the visual features from the ViT in an MLLM, which looks like (batch size, number of visual tokens, token dimension); 2) Calculate it directly using the input image resolution (e.g., 384px) and the ViT patch size (e.g., 14x14). For example, with a 384px input image and the openai/clip-vit-large-patch14 model, where 'Patch 14' means the image is divided into 14x14 pixel patches, the visual token length would be (384//14)**2 = 729.

srymaker commented 2 months ago

Thank you for your patience.

yaolinli / DeCo

About the raw token lens #1