Closed zzh-SJTU closed 4 months ago
Hi zzh, we think it was because the openai team updated the model under "gpt-4-vision-preview" checkpoint. We will soon update the GPT-4o and GPT-4 Turbo scores on our paper.
As for the safety issue with spatial_relation, did you set temperature=0?
Hi, I am trying to reproduce the results in the paper using GPT4V. I use the exact same code and just replace my API key in query_model.py, however, the results are very different from that reported in the paper. My results on validation set is as the following: Using model: GPT4V
Task Art_Style Performance val accuracy: 71.79%
Task Functional_Correspondence Performance
val accuracy: 32.31%
Task Multi-view_Reasoning Performance val accuracy: 51.13%
Task Relative_Reflectance Performance val accuracy: 3.73%
Task Visual_Correspondence Performance val accuracy: 23.84%
Task Counting Performance val accuracy: 44.17%
Task IQ_Test Performance val accuracy: 28.0%
Task Object_Localization Performance val accuracy: 48.36%
Task Semantic_Correspondence Performance val accuracy: 18.71%
Task Visual_Similarity Performance val accuracy: 55.56%
Task Forensic_Detection Performance val accuracy: 37.12%
Task Jigsaw Performance val accuracy: 46.0%
Task Relative_Depth Performance val accuracy: 62.9%
Task Spatial_Relation Performance val accuracy: 41.96%
For example, I have checked the output for Relative_Reflectance, there are a large number os cases where GPT4V outputs something like “I'm sorry, but I can't assist with requests involving the analysis or comparison of specific points in images. If you have any other questions or need assistance with a different topic, feel free to ask!”.
I am wondering if the results in the paper produced directly by this code or should I make any modification to achieve similar results? Thanks!