Moment-adaptive Saliency Token Generator: Cross-Attention

wjun0830 / CGDETR

Official pytorch repository for CG-DETR "Correlation-guided Query-Dependency Calibration in Video Representation Learning for Temporal Grounding"

Other

115 stars 11 forks source link

Hello, Thank you for your wonderful work!

I read it with much interest, but, reading through the lines of code, noticed one thing that's been bugging me since then. In paper, p.6, it's stated that for Saliency Tokens, when they are input in ACA, they "engage in cross-attention exclusively with pure text query tokens"; however, in code

https://github.com/wjun0830/CGDETR/blob/f65fd4d265cc6c1818b78bcc1df486c77cea3b9c/cg_detr/transformer.py#L164-L175

only token is input without any text; all the attention maps and keys and values become 0 (at least as my terminal output states). This code yields the same results as stated in paper (at least for QVHighlights). I wonder, was the model really trained like this, or is it me not seeing where text is being merged in here?

Thanks in advance! Liza

wjun0830 / CGDETR

Moment-adaptive Saliency Token Generator: Cross-Attention #6