position embedding - Githubissues

yitu-opensource / T2T-ViT

ICCV2021, Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet

Other

1.14k stars 177 forks source link

Hi authors. I think the sentence "we concatenate a class token to it and then add Sinusoidal Position Embedding (PE) to it, the same as ViT to do classification" in your paper is confusing. In ViT, position embedding is learnable, while your method fixes the position embedding as Sinusoidal (please correct me if I am wrong). So here "the same as ViT to do classification" is confusing. I think you mean adding class token and position embedding is similar to ViT. Maybe you can modify this in your paper.

Regarding position embedding, when finetuning with a different image size (512*512), does simply changing the length of position embedding work? If you modify the length of position embedding, then the position embedding will be totally different from the pretraining, since you can don't load the position embedding in pretrained model. Am I correct? Thanks in advance

yitu-opensource / T2T-ViT

position embedding #45