xinyu1205 / recognize-anything

Open-source and strong foundation image recognition models.
https://recognize-anything.github.io/
Apache License 2.0
2.92k stars 278 forks source link

tag_encoder and text_decoder #191

Open Stephen-K1 opened 4 months ago

Stephen-K1 commented 4 months ago

Hi, thanks for open sourcing your great work !

When reading the codes, I'm confused by the next-token prediction in calculating the loss_t2t, and I don't understand why the first four (prompt_length) labels are ignored (set as -100) during training. So I start to read the inference code hoping to figure this out. However, I found that both inference_ram.py and inference_ram_openset.py did not use the tag_encoder and text_decoder during inference, which makes me more confused. So I want to kindly ask that:

  1. Can you explain the next-token prediction in calculating the loss_t2t and why some labels are set as -100?
  2. Why tag_encoder and text_decoder are not used in the inference?

Thanks in advance !

Stephen-K1 commented 4 months ago

Alright, I think I can figure this out by reading inference_tag2text.py. Thanks for sharing the codes anyway.