tag_encoder and text_decoder

Hi, thanks for open sourcing your great work !

When reading the codes, I'm confused by the next-token prediction in calculating the loss_t2t, and I don't understand why the first four (prompt_length) labels are ignored (set as -100) during training. So I start to read the inference code hoping to figure this out. However, I found that both inference_ram.py and inference_ram_openset.py did not use the tag_encoder and text_decoder during inference, which makes me more confused. So I want to kindly ask that:

Can you explain the next-token prediction in calculating the loss_t2t and why some labels are set as -100?
Why tag_encoder and text_decoder are not used in the inference?

Thanks in advance !

xinyu1205 / recognize-anything

tag_encoder and text_decoder #191