question about Figure3 - Githubissues

Thanks for your attention. It refers to image features interact with tag features or text features by the cross-attention layers in the image tag interaction encoder, image tag recognition decoder, and image text alignment encoder (Image features as key & value). The same structure as Figure 2 in [1] and Figure 2 in [2].

[1] Li J, Li D, Xiong C, et al. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation[C]//International Conference on Machine Learning. PMLR, 2022: 12888-12900. [2] Liu S, Zhang L, Yang X, et al. Query2label: A simple transformer way to multi-label classification[J]. arXiv preprint arXiv:2107.10834, 2021.

xinyu1205 / recognize-anything

question about Figure3 #6