wjun0830 / CGDETR

Official pytorch repository for CG-DETR "Correlation-guided Query-Dependency Calibration in Video Representation Learning for Temporal Grounding"
https://arxiv.org/abs/2311.08835
Other
115 stars 11 forks source link

Moment-adaptive Saliency Token Generator: Cross-Attention #6

Closed bess-cater closed 10 months ago

bess-cater commented 10 months ago

Hello, Thank you for your wonderful work!

I read it with much interest, but, reading through the lines of code, noticed one thing that's been bugging me since then. In paper, p.6, it's stated that for Saliency Tokens, when they are input in ACA, they "engage in cross-attention exclusively with pure text query tokens"; however, in code

https://github.com/wjun0830/CGDETR/blob/f65fd4d265cc6c1818b78bcc1df486c77cea3b9c/cg_detr/transformer.py#L164-L175

only token is input without any text; all the attention maps and keys and values become 0 (at least as my terminal output states). This code yields the same results as stated in paper (at least for QVHighlights). I wonder, was the model really trained like this, or is it me not seeing where text is being merged in here?

Thanks in advance! Liza

wjun0830 commented 10 months ago

Hello Liza! Thanks for your interest in our work.

While proofreading the paper with the implementation, we found out that there was a mistake in our writing. As shown in the code, the moment adaptive saliency token is only projected with the Query-corresponding parameters of the cross-attention layer. We will revise the manuscript as soon as possible. Thanks for the notice and very sorry to confuse you.

Thank you, Liza!