yitu-opensource / T2T-ViT

ICCV2021, Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet
Other
1.14k stars 177 forks source link

position embedding #45

Closed jingliang95 closed 3 years ago

jingliang95 commented 3 years ago

Hi authors. I think the sentence "we concatenate a class token to it and then add Sinusoidal Position Embedding (PE) to it, the same as ViT to do classification" in your paper is confusing. In ViT, position embedding is learnable, while your method fixes the position embedding as Sinusoidal (please correct me if I am wrong). So here "the same as ViT to do classification" is confusing. I think you mean adding class token and position embedding is similar to ViT. Maybe you can modify this in your paper.

Regarding position embedding, when finetuning with a different image size (512*512), does simply changing the length of position embedding work? If you modify the length of position embedding, then the position embedding will be totally different from the pretraining, since you can don't load the position embedding in pretrained model. Am I correct? Thanks in advance

yuanli2333 commented 3 years ago

Hi, About "we concatenate a class token to it and then add Sinusoidal Position Embedding (PE) to it, the same as ViT to do classification", in the origianl ViT, it includes some different position embeddings: 1. Sinusoidal Position Embedding (PE); 2. 1D or 2D learned parameters; 3. Relative PE. They tried all in the original paper and you can check.

About the position embedding size, we will release our codes to do interpolation on the pretraine position embedding very soon.