Quick question regarding position embedding

ziplab / LITv2

[NeurIPS 2022 Spotlight] This is the official PyTorch implementation of "Fast Vision Transformers with HiLo Attention"

Apache License 2.0

229 stars 11 forks source link

Quick question regarding position embedding #16

Closed chengengliu closed 6 months ago

chengengliu commented 6 months ago

Hi, thanks for the excellent work! I have one question regarding position embedding. In the paper, you said that:

"Besides, we find the fixed relative positional encoding in LITv1 dramatically slows down its speed on dense prediction tasks due to the interpolation for different image resolutions. For better efficiency, we propose to adopt one 3 × 3 depthwise convolutional layer with zero-padding in each FFN to incorporate the implicitly learned position information from zero-padding".

And I notice that you use a DWConv in the MLP block; does that mean that when doing hilo attention, you did not add any position information but only depending on that MLP's ability to provide that position information? Why not use absolute position embedding before attention? Thanks!

HubHop commented 6 months ago

Hi @chengengliu, thanks for your interest!

Yes, LITv2 does not explicitly apply any positional embedding, e.g., relative/absolute positional embedding. I believe there is a potential to improve performance by introducing absolute embedding before attention. However, when design LITv2, our first principle is to keep high thoughput while maintaining performance. Therefore, 1) the fixed size of relative/absolute positional embedding may not scale well for different image resolutions, and 2) interpolation is a slow operation on hardwares, especially when a model need to generalize for different image resolutions.

I hope this help you.

chengengliu commented 6 months ago

Hi @chengengliu, thanks for your interest!

Yes, LITv2 does not explicitly apply any positional embedding, e.g., relative/absolute positional embedding. I believe there is a potential to improve performance by introducing absolute embedding before attention. However, when design LITv2, our first principle is to keep high thoughput while maintaining performance. Therefore, 1) the fixed size of relative/absolute positional embedding may not scale well for different image resolutions, and 2) interpolation is a slow operation on hardwares, especially when a model need to generalize for different image resolutions.

I hope this help you.

Thanks for the reply! I got your point now.