zeyofu / BLINK_Benchmark

This repo contains evaluation code for the paper "BLINK: Multimodal Large Language Models Can See but Not Perceive". https://arxiv.org/abs/2404.12390 [ECCV 2024]
https://zeyofu.github.io/blink/
Apache License 2.0
107 stars 7 forks source link

Can not reproduce the GPT-4-V results #6

Closed zzh-SJTU closed 4 months ago

zzh-SJTU commented 4 months ago

Hi, I am trying to reproduce the results in the paper using GPT4V. I use the exact same code and just replace my API key in query_model.py, however, the results are very different from that reported in the paper. My results on validation set is as the following: Using model: GPT4V

Task Art_Style Performance val accuracy: 71.79%

Task Functional_Correspondence Performance
val accuracy: 32.31%

Task Multi-view_Reasoning Performance val accuracy: 51.13%

Task Relative_Reflectance Performance val accuracy: 3.73%

Task Visual_Correspondence Performance val accuracy: 23.84%

Task Counting Performance val accuracy: 44.17%

Task IQ_Test Performance val accuracy: 28.0%

Task Object_Localization Performance val accuracy: 48.36%

Task Semantic_Correspondence Performance val accuracy: 18.71%

Task Visual_Similarity Performance val accuracy: 55.56%

Task Forensic_Detection Performance val accuracy: 37.12%

Task Jigsaw Performance val accuracy: 46.0%

Task Relative_Depth Performance val accuracy: 62.9%

Task Spatial_Relation Performance val accuracy: 41.96%

For example, I have checked the output for Relative_Reflectance, there are a large number os cases where GPT4V outputs something like “I'm sorry, but I can't assist with requests involving the analysis or comparison of specific points in images. If you have any other questions or need assistance with a different topic, feel free to ask!”.

I am wondering if the results in the paper produced directly by this code or should I make any modification to achieve similar results? Thanks!

zeyofu commented 4 months ago

Hi zzh, we think it was because the openai team updated the model under "gpt-4-vision-preview" checkpoint. We will soon update the GPT-4o and GPT-4 Turbo scores on our paper.

As for the safety issue with spatial_relation, did you set temperature=0?