Closed jzhang38 closed 1 year ago
Hi, thanks for pointing this out! You're right the PIXEL finetuning models don't have their position embeddings fixed. I think behavior is the same in the original ViT-MAE (https://github.com/facebookresearch/mae).
I don't think this affects performance (noticeably), considering that the position embeddings are still loaded from the pretrained model, so they will be finetuned from sinusoids rather than from scratch. Since they are in the lowest layer of the model, the gradients for the position embeddings are also relatively small, so they don't change too much from their original sinusoidal form.
You could nevertheless try setting the requires_grad
attribute of the position embeddigns to False after loading a finetuning model like PIXELForSequenceClassification
and check whether that makes a difference downstream, or alternatively use a PIXELModel
instead of ViTModel
as the base model for finetuning. Note that in the latter case you would have to make sure that the mask_ratio
is set to 0.0 as otherwise masking will be applied on the embeddings (this is the main reason why we use ViTModel actually).
Thanks for your prompt reply! That is really helpful! I will close this issue.
For instance, PIXELForSequenceClassification is used for finetuning GLUE. Its underlining ViT implementation is ViTModel instead of PIXELModel. It seems that PIXELModel follows MAE and uses fixed positional embeddings, while ViTModel follows the original ViT paper and uses learnable positional embeddings.