yangsenius / TransPose

PyTorch Implementation for "TransPose: Keypoint localization via Transformer", ICCV 2021.
https://github.com/yangsenius/TransPose/releases/download/paper/transpose.pdf
MIT License
353 stars 56 forks source link

position embedding #5

Closed lingorX closed 3 years ago

lingorX commented 3 years ago

Hi, first thank you for making this work open-source. I notice that position embeddings are summed up to sequence at each Transformer layer. But in Bert or ViT, they only conduct this PE sum operation once before sending the sequence to Transformer encoder. I wonder what's the motivation to design like this.

lingorX commented 3 years ago

[W TensorIterator.cpp:918] Warning: Mixed memory format inputs detected while calling the operator. The operator will output contiguous tensor even if some of the inputs are in channels_last format. (function operator()) [W TensorIterator.cpp:924] Warning: Mixed memory format inputs detected while calling the operator. The operator will output channels_last tensor even if some of the inputs are not in channels_last format. (function operator())

https://github.com/yangsenius/TransPose/blob/904eb4b286aafa01f3b0c45e9f8a914a6443621b/lib/models/transpose_h.py#L522 https://github.com/yangsenius/TransPose/blob/904eb4b286aafa01f3b0c45e9f8a914a6443621b/lib/models/transpose_h.py#L523 maybe contiguous() is needed

yangsenius commented 3 years ago

Hi~ @0liliulei.

From the view of the whole Transformer Encoder, only injecting PE to the input sequence once is enough. But for each self-attention layer, it is also permutation-equivariant to its input sequence. We think it is beneficial to add consistent position embedding to all attention layers since human pose estimation is a localization task rather than image classification (ViT). This task may be sensitive to position information, particularly the last few layers. This is our motivation.

And we empirically find adding PE to each layer performs a little better than only adding it to the initial input. In addition, DETR also conducts ablations on this and the results show adding PE to each attention layer is better.

Regarding the warning information you reported, I haven't encountered this problem. I don't know what your situation causes it. And thank you very much for your suggestion!

lingorX commented 3 years ago

Thank you for your answer, as PE is important for training a Transformer.