Can not reproduce the GPT-4-V results

Hi, I am trying to reproduce the results in the paper using GPT4V. I use the exact same code and just replace my API key in query_model.py, however, the results are very different from that reported in the paper. My results on validation set is as the following: Using model: GPT4V

Task Art_Style Performance val accuracy: 71.79%

Task Functional_Correspondence Performance
val accuracy: 32.31%

Task Multi-view_Reasoning Performance val accuracy: 51.13%

Task Relative_Reflectance Performance val accuracy: 3.73%

Task Visual_Correspondence Performance val accuracy: 23.84%

Task Counting Performance val accuracy: 44.17%

Task IQ_Test Performance val accuracy: 28.0%

Task Object_Localization Performance val accuracy: 48.36%

Task Semantic_Correspondence Performance val accuracy: 18.71%

Task Visual_Similarity Performance val accuracy: 55.56%

Task Forensic_Detection Performance val accuracy: 37.12%

Task Jigsaw Performance val accuracy: 46.0%

Task Relative_Depth Performance val accuracy: 62.9%

Task Spatial_Relation Performance val accuracy: 41.96%

For example, I have checked the output for Relative_Reflectance, there are a large number os cases where GPT4V outputs something like “I'm sorry, but I can't assist with requests involving the analysis or comparison of specific points in images. If you have any other questions or need assistance with a different topic, feel free to ask!”.

I am wondering if the results in the paper produced directly by this code or should I make any modification to achieve similar results? Thanks!

zeyofu / BLINK_Benchmark