The implementation uses 2D sincos embedding instead of 1D?

Yes that's right. They are essentially treated as a square image despite the horizontal rendering. My interpretation is that the model most likely learns to ignore the Y-axis of the position embeddings for the most part as they do not have meaningful information based on how the text is rendered (i.e., the text is wrapped in arbitrary positions). However, if you actually had text rendered in 2D (e.g. text that already comes in rendered form and contains line breaks, 2D fonts, etc.), the second dimension of the position embeddings could come in handy. In some internal discussions, we had considered switching to 1D sinusoids, but we eventually decided to stick with the 2D sinsuoids from the ViT-MAE paper to facilitate extending the model to 2D inputs at fine-tuning time.

xplip / pixel

The implementation uses 2D sincos embedding instead of 1D? #7