Closed chengengliu closed 6 months ago
Hi @chengengliu, thanks for your interest!
Yes, LITv2 does not explicitly apply any positional embedding, e.g., relative/absolute positional embedding. I believe there is a potential to improve performance by introducing absolute embedding before attention. However, when design LITv2, our first principle is to keep high thoughput while maintaining performance. Therefore, 1) the fixed size of relative/absolute positional embedding may not scale well for different image resolutions, and 2) interpolation is a slow operation on hardwares, especially when a model need to generalize for different image resolutions.
I hope this help you.
Hi @chengengliu, thanks for your interest!
Yes, LITv2 does not explicitly apply any positional embedding, e.g., relative/absolute positional embedding. I believe there is a potential to improve performance by introducing absolute embedding before attention. However, when design LITv2, our first principle is to keep high thoughput while maintaining performance. Therefore, 1) the fixed size of relative/absolute positional embedding may not scale well for different image resolutions, and 2) interpolation is a slow operation on hardwares, especially when a model need to generalize for different image resolutions.
I hope this help you.
Thanks for the reply! I got your point now.
Hi, thanks for the excellent work! I have one question regarding position embedding. In the paper, you said that:
"Besides, we find the fixed relative positional encoding in LITv1 dramatically slows down its speed on dense prediction tasks due to the interpolation for different image resolutions. For better efficiency, we propose to adopt one 3 × 3 depthwise convolutional layer with zero-padding in each FFN to incorporate the implicitly learned position information from zero-padding".
And I notice that you use a DWConv in the MLP block; does that mean that when doing hilo attention, you did not add any position information but only depending on that MLP's ability to provide that position information? Why not use absolute position embedding before attention? Thanks!