Regarding the SampledNCELoss

While browsing the code base, I was confused by the SampledNCE loss function in this part:

https://github.com/yeliudev/R2-Tuning/blob/03fbcf68b82e0d72a572b2a33f7f0f603681dcd1/models/loss.py#L47-L59

The input video_emb and query_emb have shapes (B, T, C) and (B, 1, C). i_sim is the batch-wise cosine sim between video_emb and query_emb and has shape (B, T), i.e, i_sim[b] = video_emb[b] @ query_emb[b].T

So the cosine similarities are essentially intra-video, i.e., a query embedding only interacts with its associated video' visual embeddings and NOT those in other videos in the batch.

During the loss calculation for when 'col' is in self.direction, it seems that you are trying to calculate the inter-video contrastive loss by flipping i_sim and take softmax along the batch dimension. However, as mentioned before, the calculation for the cosine similarities is independent between batch samples. Lets denote query embedding as $\mathbf{q}$, video embedding as $\mathbf{v}$, and positive clips as $\mathbf{p}$. The denominator of softmax in this case would be something like:

$$exp(\mathbf{q}[0]\cdot \mathbf{v}[0,\mathbf{p}[0]]) + exp(\mathbf{q}[1]\cdot \mathbf{v}[1,\mathbf{p}[1]]) + ....$$

where 0 and 1 are batch indices. Therefore, there is not a common anchor sample during the contrastive loss calculation. Intuitively, minimizing the contrastive loss calculated this way shouldn't bring any effect as all terms in the denominator are decoupled.

I would appreciate any clarification on this point. Thank you.

yeliudev / R2-Tuning

Regarding the SampledNCELoss #11