yfzhang114 / MME-RealWorld

✨✨ MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?
77 stars 5 forks source link

【Evaluation Script】 #2

Closed Luo-Z13 closed 1 month ago

Luo-Z13 commented 1 month ago

Hello, could you please provide an evaluation script for mme_realworld under the official LLaVA repository (with circular evaluation strategy)? I have created one script based on model_vqa_loader.py using the promot from this repo, but it seems like LLaVA 1.5 is experiencing hallucinations (i.e., it only outputs option A). After modifying the prompt, it seems to be working normally, but my evaluation prompt is now inconsistent with the original prompt. I also tried using VLMEvalKit, but there were issues such as network problems and missing documentation for using my own model.

If possible, could you provide an evaluation script based on LLaVA 1.5? Thank you!

yfzhang114 commented 1 month ago

Thank you for your attention to this matter. We have indeed observed the hallucination issue as well, but we have conducted evaluations in a relatively fair environment using the same prompts.

To assist with your evaluation, we have provided a script similar to the one in the LLaVA repository. You can find it in the file evaluation/model_vqa_mme_real_world.py. We hope this script proves helpful for your evaluation needs.

Luo-Z13 commented 1 month ago

Thank you for your attention to this matter. We have indeed observed the hallucination issue as well, but we have conducted evaluations in a relatively fair environment using the same prompts.

To assist with your evaluation, we have provided a script similar to the one in the LLaVA repository. You can find it in the file evaluation/model_vqa_mme_real_world.py. We hope this script proves helpful for your evaluation needs.

Thank you for the provided script! It seems that the hallucination issue persists🤔, with the model still only outputting option 'A'. Interestingly, when I changed the prompt from qs += choice_prompt + self.prompt + '\nThe best answer is:' to qs += choice_prompt + self.prompt, the model was able to output other options ('B'-'E'). I will reconfigure LLaVA 1.5 in my device to check if this issue is due to my local code.

yfzhang114 commented 1 month ago

Thank you for your feedback and for testing the script. It's indeed possible that the model's sensitivity to the prompt could be causing the issue. Since you've observed that changing the prompt leads to different outputs, this reinforces the idea that prompt design can significantly impact model behavior.