xplip / pixel

Research code for pixel-based encoders of language (PIXEL)
https://arxiv.org/abs/2207.06991
Apache License 2.0
332 stars 33 forks source link

The implementation uses 2D sincos embedding instead of 1D? #7

Closed jzhang38 closed 1 year ago

jzhang38 commented 1 year ago

It seems that you are using 2D sincos embedding for PIXEL here. This essentially implies that when the patches are fed to the transformer, they are actually reordered as a square image, even though they are initially rendered horizontally. Is my understanding correct here?

xplip commented 1 year ago

Yes that's right. They are essentially treated as a square image despite the horizontal rendering. My interpretation is that the model most likely learns to ignore the Y-axis of the position embeddings for the most part as they do not have meaningful information based on how the text is rendered (i.e., the text is wrapped in arbitrary positions). However, if you actually had text rendered in 2D (e.g. text that already comes in rendered form and contains line breaks, 2D fonts, etc.), the second dimension of the position embeddings could come in handy. In some internal discussions, we had considered switching to 1D sinusoids, but we eventually decided to stick with the 2D sinsuoids from the ViT-MAE paper to facilitate extending the model to 2D inputs at fine-tuning time.