xinyu1205 / recognize-anything

Open-source and strong foundation image recognition models.
https://recognize-anything.github.io/
Apache License 2.0
2.74k stars 265 forks source link

What are the prompts during training? #43

Closed Qinying-Liu closed 1 year ago

Qinying-Liu commented 1 year ago

Hi, Thank you for your excellent work. I am curious about the prompt templates (eg, 'a photo of a {}') for the tags during training. Are these templates similar to those utilized in CLIP? However, the prompt templates used in CLIP seem to be more appropriate for noun tags, as opposed to adjective or verb tags (eg, 'red' or 'play'). Thanks.

xinyu1205 commented 1 year ago

Please refer to recognize-anything/models/openset_utils.py multiple_templates. We follow [1].

[1] Zang Y, Li W, Zhou K, et al. Open-vocabulary detr with conditional matching[C]//Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part IX. Cham: Springer Nature Switzerland, 2022: 106-122.

Qinying-Liu commented 1 year ago

I appreciate your prompt response. I've noticed that the prompt templates in openset_utils.py seem to be designed primarily for noun tags. However, the tag list in ram_tag_list.txt contains some adjective and verb tags that seem incompatible with these templates, resulting in potentially illogical prompts like "a photo of my red." Could you please explain this discrepancy?

xinyu1205 commented 1 year ago

We did not consider this. I think you are right, designing different prompts for tags with different part of speech is highly likely to improve the performance of the model.