clip_text_features are the features from the CLIP text encoder for the whole text. clip_word_tokens are the features for the particular class name (use the end_token_ids as the index). The clip_text_features can represent the text feature for the bag of the region, but clip_word_tokens represent the text features for one proposal. Do I understand this correctly?
More importantly, the implementation for the clip_word_tokens makes me confused. In lines
def forward(self, x, return_tokens=False, cls_indices=None, attn_masks=None):
att, tokens = self.attention(self.ln_1(x), return_tokens, attn_masks=attn_masks)
if return_tokens:
assert cls_indices is not None
if not isinstance(cls_indices, int):
assert len(cls_indices) == x.shape[1] # x: LNC
cls_tokens = x[cls_indices, torch.arange(x.shape[1])]
tokens = cls_tokens[None] + tokens
tokens = tokens + self.mlp(self.ln_2(tokens))
x = x + att
x = x + self.mlp(self.ln_2(x))
return x, tokens
else:
assert tokens is None
x = x + att
# x = x + self.attention(self.ln_1(x))
x = x + self.mlp(self.ln_2(x))
return x, None
Could the author provide some explanations or papers for this implementation? This really helps me a lot, thanks!
Hi, thanks for your excellent work.
I am reading the code of this project but the difference between clip_text_features and clip_word_tokens in this line:
clip_text_features
are the features from the CLIP text encoder for the whole text.clip_word_tokens
are the features for the particular class name (use theend_token_ids
as the index). Theclip_text_features
can represent the text feature for the bag of the region, butclip_word_tokens
represent the text features for one proposal. Do I understand this correctly?More importantly, the implementation for the
clip_word_tokens
makes me confused. In linesCould the author provide some explanations or papers for this implementation? This really helps me a lot, thanks!