Open sky97613 opened 1 year ago
The fixed text encoder is used to encode tags separately into textual label queries with semantic information, enable inference generalization to categories that have not been seen in training stage. For example, during training, the model only trains the tag of 'dog'. Due to the similarity of the textual label embedding between 'puppy' and 'dog', inference can be generalized to the tag of 'puppy'.
Thank you for answer!
Does the list of tags entered as the input of CLIP Text Encoder in the figure mean the tags extracted from the training data (text)?
Please refer to the code in recognize-anything/models/openset_utils.py build_openset_label_embedding().
In paper, figure 3![image](https://github.com/xinyu1205/recognize-anything/assets/37361632/10c12b82-dedb-4f4c-a934-490db65720b1)
Can you confirm that I understand the overall system architecture (Figure 3) of the thesis? The architecture structure of the thesis I understand is as follows.
Image Encoder
Image-Tag Recognition Decoder
Image-Tag Interaction Encoder and Image-Tag-Text Generation Decoder
CLIP Text Encoder
Textual Label Queries
I'm reading the thesis, but I don't quite understand the contents of #4 and #5. Could you please explain it easily?