ys-zong / VL-ICL

Code for paper: VL-ICL Bench: The Devil in the Details of Benchmarking Multimodal In-Context Learning
https://ys-zong.github.io/VL-ICL/
29 stars 2 forks source link

LLaVA predictions are always empty #1

Closed icrto closed 7 months ago

icrto commented 7 months ago

Thanks for the awesome paper and repo!

I was trying out LLaVA and noticed that the model predictions were always empty strings, and was able to narrow it down to this line https://github.com/ys-zong/VL-ICL/blob/6ad043d625f8205b75f57e720441cfc98d0fd5a1/utils/model_inference.py#L76.

It seems LLaVA already outputs only the tokens it generated, and not the whole context tokens + generated tokens.

The fix is quite easy, just decode the whole generated sentence, without truncating it first with the input_token_len.

Is my thinking correct or am I missing something?

ys-zong commented 7 months ago

Hi, thanks for your interest. I think that's due to different versions of Llava - the older version I used when developing this repo seems to need this slicing (I'll need to double check it). But if decoding the whole generated sentence works for you, that's totally fine to just go without truncation.

icrto commented 7 months ago

Ah I see thanks!