yangli18 / VLTVG

Improving Visual Grounding with Visual-Linguistic Verification and Iterative Reasoning, CVPR 2022
88 stars 7 forks source link

Question about Eq.1 #10

Open LeungWaiHo opened 1 year ago

LeungWaiHo commented 1 year ago

I noticed "compute their semantic correlation as the verification score" about Eq.1 My question: 1) Does it work as a similarity function? 2) Could it be replaced by other similarity functions, such as cosine ...?

yangli18 commented 1 year ago

@LeungWaiHo A1: Yes, it can be understood as a similarity function, which measures how relevant each visual feature is to the content described in the text. A2: The output of this function, which measures the correlation/relevance between visual and text features, should range from 0 to 1, yet the cosine function has an output range of [-1, 1]. We use Eq.1 to adapt the cosine similarity outputs to [0, 1]. You can try other functions with similar effects.

wildwolff commented 1 year ago

I cannot understand why this S(x,y) in Eq.1 can be seen as the relevance score, and the code computes verify_score by element-wise multiplication without Transpose,which is a little different with Eq.1. Could you further explain it?Thanks a lot!

text_embed = self.text_proj(text_info) img_embed = self.img_proj(img_feat) verify_score = (F.normalize(img_embed, p=2, dim=-1) F.normalize(text_embed, p=2, dim=-1)).sum(dim=-1, keepdim=True) verify_score = self.tf_scale \ torch.exp( - (1 - verify_score).pow(self.tf_pow) \ / (2 * self.tf_sigma**2))

yangli18 commented 1 year ago

It's just a matter of implementation. The inside part of Eq. 1 essentially computes the inner product of two feature vectors. Actually, you can use bmm after transposing the matrix/vector ( [Bx1xC] * [BxCx1] = [Bx1x1]), which is equivalent to the way I implemented it.