wusize / ovdet

[CVPR2023] Code Release of Aligning Bag of Regions for Open-Vocabulary Object Detection
172 stars 4 forks source link

What is the difference between clip_text_features and clip_word_tokens? #27

Open kinredon opened 1 year ago

kinredon commented 1 year ago

Hi, thanks for your excellent work.

I am reading the code of this project but the difference between clip_text_features and clip_word_tokens in this line:

clip_text_features, clip_word_tokens = \
    text_encoder.encode_pseudo_text(pseudo_text, end_token_ids,
                                    text_pe=True, normalize=True,

clip_text_features are the features from the CLIP text encoder for the whole text. clip_word_tokens are the features for the particular class name (use the end_token_ids as the index). The clip_text_features can represent the text feature for the bag of the region, but clip_word_tokens represent the text features for one proposal. Do I understand this correctly?

More importantly, the implementation for the clip_word_tokens makes me confused. In lines

def forward(self, x, return_tokens=False, cls_indices=None, attn_masks=None):
    att, tokens = self.attention(self.ln_1(x), return_tokens, attn_masks=attn_masks)
    if return_tokens:
        assert cls_indices is not None
        if not isinstance(cls_indices, int):
            assert len(cls_indices) == x.shape[1]   # x: LNC
        cls_tokens = x[cls_indices, torch.arange(x.shape[1])]
        tokens = cls_tokens[None] + tokens
        tokens = tokens + self.mlp(self.ln_2(tokens))

        x = x + att
        x = x + self.mlp(self.ln_2(x))

        return x, tokens
        assert tokens is None
        x = x + att
        # x = x + self.attention(self.ln_1(x))
        x = x + self.mlp(self.ln_2(x))

        return x, None

Could the author provide some explanations or papers for this implementation? This really helps me a lot, thanks!