Pretraining with im_start_end token

I wanted to know why the prepare_input_labels_for_multimodal function in llava_arch.py is designed to throw an exception during pretraining if the mm_use_im_start_end option is enabled:

# TODO: image start / end is not implemented here to support pretraining.
if getattr(self.config, 'tune_mm_mlp_adapter', False) and getattr(self.config, 'mm_use_im_start_end', False):
    raise NotImplementedError

Shouldn't the function work fine even if there are and tokens around the image tokens?

xiaoachen98 / Open-LLaVA-NeXT

Pretraining with im_start_end token #18