Closed pangzss closed 3 months ago
Hi @pangzss, thanks for your interest in our work!
You are correct, the loss under 'col' direction is indeed counterintuitive (and perhaps unreasonable due to the softmax dimension problem you mentioned). For this part, we simply followed UniVTG's implementation and also tested that whether to add 'col' or not does not affect the model performance.
While browsing the code base, I was confused by the SampledNCE loss function in this part:
https://github.com/yeliudev/R2-Tuning/blob/03fbcf68b82e0d72a572b2a33f7f0f603681dcd1/models/loss.py#L47-L59
The input video_emb and query_emb have shapes (B, T, C) and (B, 1, C). i_sim is the batch-wise cosine sim between video_emb and query_emb and has shape (B, T), i.e, i_sim[b] = video_emb[b] @ query_emb[b].T
So the cosine similarities are essentially intra-video, i.e., a query embedding only interacts with its associated video' visual embeddings and NOT those in other videos in the batch.
During the loss calculation for when 'col' is in self.direction, it seems that you are trying to calculate the inter-video contrastive loss by flipping i_sim and take softmax along the batch dimension. However, as mentioned before, the calculation for the cosine similarities is independent between batch samples. Lets denote query embedding as $\mathbf{q}$, video embedding as $\mathbf{v}$, and positive clips as $\mathbf{p}$. The denominator of softmax in this case would be something like:
$$exp(\mathbf{q}[0]\cdot \mathbf{v}[0,\mathbf{p}[0]]) + exp(\mathbf{q}[1]\cdot \mathbf{v}[1,\mathbf{p}[1]]) + ....$$
where 0 and 1 are batch indices. Therefore, there is not a common anchor sample during the contrastive loss calculation. Intuitively, minimizing the contrastive loss calculated this way shouldn't bring any effect as all terms in the denominator are decoupled.
I would appreciate any clarification on this point. Thank you.