yangsenius / TransPose

PyTorch Implementation for "TransPose: Keypoint localization via Transformer", ICCV 2021.
https://github.com/yangsenius/TransPose/releases/download/paper/transpose.pdf
MIT License
358 stars 57 forks source link

About the image patch #23

Closed wangdong0556 closed 3 years ago

wangdong0556 commented 3 years ago

Thank you for your excellent work, I found a problem in reading the code: In VIT, to handle 2D images, they reshape the image x ∈ R^H×W×C into a sequence of flattened 2D patches xp ∈ R^N×(P2·C), where N = HW/P2 and (P, P) is the resolution of each image patch. In this method, we reshape the image x ∈ R^H×W×C into a sequence of flattened 2D patches xp ∈ R^C×(HW), then embedding is performed. Is our resolution of each image patch (1, 1)??
What are the benefits of this setup?

yangsenius commented 3 years ago

Hi. TransPose uses a shallow CNN to downsample the input image into a H/r* W/r resolution before sending it to Transformer. It is more like the ViT hybrid architecture that combines ResNet and Transformer. You can think that the patch is (1,1) for the downsampled image feature maps.

wangdong0556 commented 3 years ago

I understand, thank you for your reply!