Closed jzhang38 closed 1 year ago
Yes that's right. They are essentially treated as a square image despite the horizontal rendering. My interpretation is that the model most likely learns to ignore the Y-axis of the position embeddings for the most part as they do not have meaningful information based on how the text is rendered (i.e., the text is wrapped in arbitrary positions). However, if you actually had text rendered in 2D (e.g. text that already comes in rendered form and contains line breaks, 2D fonts, etc.), the second dimension of the position embeddings could come in handy. In some internal discussions, we had considered switching to 1D sinusoids, but we eventually decided to stick with the 2D sinsuoids from the ViT-MAE paper to facilitate extending the model to 2D inputs at fine-tuning time.
It seems that you are using 2D sincos embedding for PIXEL here. This essentially implies that when the patches are fed to the transformer, they are actually reordered as a square image, even though they are initially rendered horizontally. Is my understanding correct here?