yaolinli / DeCo

BSD 3-Clause "New" or "Revised" License
22 stars 0 forks source link

About the raw token lens #1

Open srymaker opened 2 months ago

srymaker commented 2 months ago
截屏2024-08-23 11 46 49

Thanks for your great work! I wanna know how you compute the raw token lens, just like the 729 in the image.

yaolinli commented 2 months ago

Hi, you can find the raw vision token length in two ways: 1) Print the output shape of the visual features from the ViT in an MLLM, which looks like (batch size, number of visual tokens, token dimension); 2) Calculate it directly using the input image resolution (e.g., 384px) and the ViT patch size (e.g., 14x14). For example, with a 384px input image and the openai/clip-vit-large-patch14 model, where 'Patch 14' means the image is divided into 14x14 pixel patches, the visual token length would be (384//14)**2 = 729.

srymaker commented 2 months ago

Thank you for your patience.