Open SorasakiHiina opened 3 months ago
Hi, thanks for your interest. Your code looks correct but the log looks a bit suspicious:
input Text:
<image> # shot 0
material: metal
Answer: 3
<image> # shot 1
shape: cylinder
Answer: 5
<image> # query
For the log, the query does not have a text input, unlike the support set, e.g. "material: metal" - can you do print(input_text)
after input_text += f"{query_text}\nAnswer:"
to check again?
Also, it might be just because the model didn't learn from the 2-shot support set. You can try to use a larger number of shots to see if that makes a difference?
Thank you for your great work; I appreciate it!
I want to use the new version of Llava (Specifically, llama3-llava-next-8b, which you can download checkpoint here: https://huggingface.co/lmms-lab/llama3-llava-next-8b) to implement in-content learning for my custom tasks. To test its context-learning ability. I replaced the code to support the loading of this model and slightly changed the inference code to generate a full answer.
I use the following code to run inference on clevr dataset:
CUDA_VISIBLE_DEVICES=1 python I2T_inference.py --engine llama-llava-8b --n_shot 2 --dataset clevr --task_description detailed
And here is the running log:
It seems like Llama3-llava does not learn anything knowledge from the context and only "sees" one image. Is there anything wrong with my modified inference code? Or does llama3-llava perform badly in in-context learning?
Here is my modified inference code: Start from utils\model_inference.py Line 39: