Closed wangdong0556 closed 3 years ago
Hi. TransPose uses a shallow CNN to downsample the input image into a H/r* W/r resolution before sending it to Transformer. It is more like the ViT hybrid architecture that combines ResNet and Transformer. You can think that the patch is (1,1) for the downsampled image feature maps.
I understand, thank you for your reply!
Thank you for your excellent work, I found a problem in reading the code: In VIT, to handle 2D images, they reshape the image x ∈ R^H×W×C into a sequence of flattened 2D patches xp ∈ R^N×(P2·C), where N = HW/P2 and (P, P) is the resolution of each image patch. In this method, we reshape the image x ∈ R^H×W×C into a sequence of flattened 2D patches xp ∈ R^C×(HW), then embedding is performed. Is our resolution of each image patch (1, 1)??
What are the benefits of this setup?