Closed RuiFeiHe closed 1 year ago
Thanks for your attention. It refers to image features interact with tag features or text features by the cross-attention layers in the image tag interaction encoder, image tag recognition decoder, and image text alignment encoder (Image features as key & value). The same structure as Figure 2 in [1] and Figure 2 in [2].
[1] Li J, Li D, Xiong C, et al. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation[C]//International Conference on Machine Learning. PMLR, 2022: 12888-12900. [2] Liu S, Zhang L, Yang X, et al. Query2label: A simple transformer way to multi-label classification[J]. arXiv preprint arXiv:2107.10834, 2021.
Hi,
Great work! I just got a quick question: what does the "Cross Attention" mean in your figure 3 in your paper? Does it means it inputs the cross attention maps or values into latter process?
Thanks!