xplip / pixel

Research code for pixel-based encoders of language (PIXEL)
https://arxiv.org/abs/2207.06991
Apache License 2.0
329 stars 33 forks source link

The positional embedding is fixed during pretraining but learnable during finetuning? #8

Closed jzhang38 closed 1 year ago

jzhang38 commented 1 year ago

For instance, PIXELForSequenceClassification is used for finetuning GLUE. Its underlining ViT implementation is ViTModel instead of PIXELModel. It seems that PIXELModel follows MAE and uses fixed positional embeddings, while ViTModel follows the original ViT paper and uses learnable positional embeddings.

xplip commented 1 year ago

Hi, thanks for pointing this out! You're right the PIXEL finetuning models don't have their position embeddings fixed. I think behavior is the same in the original ViT-MAE (https://github.com/facebookresearch/mae).

I don't think this affects performance (noticeably), considering that the position embeddings are still loaded from the pretrained model, so they will be finetuned from sinusoids rather than from scratch. Since they are in the lowest layer of the model, the gradients for the position embeddings are also relatively small, so they don't change too much from their original sinusoidal form.

You could nevertheless try setting the requires_grad attribute of the position embeddigns to False after loading a finetuning model like PIXELForSequenceClassification and check whether that makes a difference downstream, or alternatively use a PIXELModel instead of ViTModel as the base model for finetuning. Note that in the latter case you would have to make sure that the mask_ratio is set to 0.0 as otherwise masking will be applied on the embeddings (this is the main reason why we use ViTModel actually).

jzhang38 commented 1 year ago

Thanks for your prompt reply! That is really helpful! I will close this issue.