I wanted to know why the prepare_input_labels_for_multimodal function in llava_arch.py is designed to throw an exception during pretraining if the mm_use_im_start_end option is enabled:
# TODO: image start / end is not implemented here to support pretraining.
if getattr(self.config, 'tune_mm_mlp_adapter', False) and getattr(self.config, 'mm_use_im_start_end', False):
raise NotImplementedError
Shouldn't the function work fine even if there are and tokens around the image tokens?
I wanted to know why the
prepare_input_labels_for_multimodal
function inllava_arch.py
is designed to throw an exception during pretraining if themm_use_im_start_end
option is enabled:Shouldn't the function work fine even if there are and tokens around the image tokens?