Question about TransformerEncoder input

VGrondin commented 2 years ago

Hi, first off thank you for this great work!

I'm trying to implement the Transformer part of your work to a mask r-cnn model. Using a Swin backbone, the RPN gives me n bounding box proposal and for each proposal I have the bbox features shape=[n, 256, 14, 14] extracted by the backbone. Now for each bbox, I would like to get the keypoints with :

    def forward(self, x):
        # x = self.conv1(x)
        # x = self.bn1(x)
        # x = self.relu(x)
        # x = self.maxpool(x)

        # x = self.layer1(x)
        # x = self.layer2(x)
        # x = self.reduce(x)

        n, c, h, w = x.shape
        x = x.flatten(2).permute(2, 0, 1)
        x = self.global_encoder(x, pos=self.pos_embedding)
        x = x.permute(1, 2, 0).contiguous().view(n, c, h, w)
        x = self.deconv_layers(x)
        x = self.final_layer(x)

        return x

I'm having an error at the line self.global_encoder(x, pos=self.pos_embedding) The size of tensor a (196) must match the size of tensor b (1024) at non-singleton dimension 0 The x input in self.global_encoder(x, pos=self.pos_embedding) is of shape [256, n, 196], which seems wrong? I tried with shape [n, 256, 196] but it doesnt work either. What am I missing?

yangsenius commented 2 years ago

Hi, @VGrondin

I suspect that the shape of the position embedding is not the same as the shape of the x.

Please check out you have passed the correct shape parameters to this function. https://github.com/yangsenius/TransPose/blob/dab9007b6f61c9c8dce04d61669a04922bbcd148/lib/models/transpose_r.py#L296

VGrondin commented 2 years ago

Yes you are right! I had MODEL.IMAGE_SIZE = [256, 256], but in my case it's features from the backbone, so [14, 14]. I'm curious to see how well it will perform if I use such a small size.

Thanks for the help

yangsenius / TransPose

Question about TransformerEncoder input #27