Can you confirm that I understand the overall system architecture (Figure 3) of the thesis?

sky97613 commented 1 year ago

In paper, figure 3

Can you confirm that I understand the overall system architecture (Figure 3) of the thesis? The architecture structure of the thesis I understand is as follows.

Image Encoder
- Extract features from the input data of the image.
Image-Tag Recognition Decoder
- Predict tags from image feature data.
- Tag data obtains tags through parsing in given data (image-text)
Image-Tag Interaction Encoder and Image-Tag-Text Generation Decoder
- It receives the feature data and tag data of the image as input and creates a sentence that describes the image.
- Example:
  - inputs: [cat, lay, suitcase, pllow] + image-feature
  - output: A cat laying in a suitcase next to the pillow
CLIP Text Encoder
- image feature + tag list embedding?....(not sure)
Textual Label Queries
- It seems to support to learning Image-Tag Recognition Decoder
  - And only used training step

I'm reading the thesis, but I don't quite understand the contents of #4 and #5. Could you please explain it easily?

xinyu1205 commented 1 year ago

The fixed text encoder is used to encode tags separately into textual label queries with semantic information, enable inference generalization to categories that have not been seen in training stage. For example, during training, the model only trains the tag of 'dog'. Due to the similarity of the textual label embedding between 'puppy' and 'dog', inference can be generalized to the tag of 'puppy'.

sky97613 commented 1 year ago

Thank you for answer!

Does the list of tags entered as the input of CLIP Text Encoder in the figure mean the tags extracted from the training data (text)?

xinyu1205 commented 1 year ago

Please refer to the code in recognize-anything/models/openset_utils.py build_openset_label_embedding().

xinyu1205 / recognize-anything

Can you confirm that I understand the overall system architecture (Figure 3) of the thesis? #41