Closed bess-cater closed 10 months ago
Hello Liza! Thanks for your interest in our work.
While proofreading the paper with the implementation, we found out that there was a mistake in our writing. As shown in the code, the moment adaptive saliency token is only projected with the Query-corresponding parameters of the cross-attention layer. We will revise the manuscript as soon as possible. Thanks for the notice and very sorry to confuse you.
Thank you, Liza!
Hello, Thank you for your wonderful work!
I read it with much interest, but, reading through the lines of code, noticed one thing that's been bugging me since then. In paper, p.6, it's stated that for Saliency Tokens, when they are input in ACA, they "engage in cross-attention exclusively with pure text query tokens"; however, in code
https://github.com/wjun0830/CGDETR/blob/f65fd4d265cc6c1818b78bcc1df486c77cea3b9c/cg_detr/transformer.py#L164-L175
only token is input without any text; all the attention maps and keys and values become 0 (at least as my terminal output states). This code yields the same results as stated in paper (at least for QVHighlights). I wonder, was the model really trained like this, or is it me not seeing where text is being merged in here?
Thanks in advance! Liza