updated version of Llava inference issue

Thank you for your great work; I appreciate it!

I want to use the new version of Llava (Specifically, llama3-llava-next-8b, which you can download checkpoint here: https://huggingface.co/lmms-lab/llama3-llava-next-8b) to implement in-content learning for my custom tasks. To test its context-learning ability. I replaced the code to support the loading of this model and slightly changed the inference code to generate a full answer.

I use the following code to run inference on clevr dataset: CUDA_VISIBLE_DEVICES=1 python I2T_inference.py --engine llama-llava-8b --n_shot 2 --dataset clevr --task_description detailed

And here is the running log:

Loading checkpoint shards: 100%|██████████████████| 4/4 [00:04<00:00,  1.24s/it]
Loaded model: llama-llava-8b

input Text:
The image contains objects of different shapes, colors, sizes and materials. The question describes the attribute and its value. You need to find all objects within the image that satisfy the condition. You should induce what operation to use according to the results of the in-context examples and then calculate the result.
<image>
material: metal
Answer: 3
<image>
shape: cylinder
Answer: 5
<image>

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
!---Output----!
['\nThe image shows a collection of objects with different shapes, colors, and materials. There are three green objects, one silver object, one purple object, and one blue object. The green objects are cubes, the silver object is a sphere, the purple object is a cylinder, and the blue object is a sphere. The yellow object is a sphere as well. The materials of the objects are not specified, but based on their appearance, the green cubes and the yellow sphere might be made of a smooth, possibly plastic material, while the silver sphere could be made of metal, and the purple and blue spheres could be made of a reflective material like rubber or plastic.<|eot_id|>']

It seems like Llama3-llava does not learn anything knowledge from the context and only "sees" one image. Is there anything wrong with my modified inference code? Or does llama3-llava perform badly in in-context learning?

Here is my modified inference code: Start from utils\model_inference.py Line 39:


elif 'llava' in engine:
        images = []
        input_text = f"{task_instruction}\n"
        for i in range(len(n_shot_support)):
            for image_path in n_shot_support[i]['image']:
                images.append(Image.open(os.path.join(data_path, image_path)).convert("RGB"))
                input_text += f"{DEFAULT_IMAGE_TOKEN}\n"
            input_text += f"{n_shot_support[i]['question']}\nAnswer: {format_answer(n_shot_support[i]['answer'], dataset, query)}\n"

        for query_image in query_images:
            images.append(query_image)
            input_text += f"{DEFAULT_IMAGE_TOKEN}\n"

        print("input Text:")
        print(input_text)
        input_text += f"{query_text}\nAnswer:"
        image_tensor = torch.stack(
                [
                    processor.preprocess(image_file, return_tensors="pt")["pixel_values"][0]
                    for image_file in images
                ]
            )
        image_tensor = image_tensor.half().cuda()
        conv_mode = "llava_llama_3"

        # The original implementation in your code will cause a Nonetype Error in llama3-llava. 
        # So I change it according to https://github.com/LLaVA-VL/LLaVA-NeXT/blob/inference/docs/LLaVA-NeXT.md
        conv = copy.deepcopy(conv_templates[conv_mode])
        #conv = conv_templates[conv_mode].copy()

        conv.append_message(conv.roles[0], input_text)
        conv.append_message(conv.roles[1], None)

        prompt = conv.get_prompt()
        input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors='pt').unsqueeze(0).cuda()

        with torch.inference_mode():
            generated_ids = model.generate(
                input_ids,
                images=image_tensor,
                do_sample=False,
                max_new_tokens=1024,
                )
        input_token_len = input_ids.shape[1]

        # It seems like there is no need for truncation. 
        #predicted_answers = tokenizer.batch_decode(generated_ids[:, input_token_len:], skip_special_tokens=True)[0]
        predicted_answers = tokenizer.batch_decode(generated_ids)

ys-zong / VL-ICL

updated version of Llava inference issue #6