zeyofu / BLINK_Benchmark

This repo contains evaluation code for the paper "BLINK: Multimodal Large Language Models Can See but Not Perceive". https://arxiv.org/abs/2404.12390 [ECCV 2024]
https://zeyofu.github.io/blink/
Apache License 2.0
100 stars 6 forks source link

About evaluation result. #1

Closed ApolloRay closed 4 months ago

ApolloRay commented 4 months ago

Can you provide the original output ( before gpt3.5 )for llava_v1.6_34b ? For task multi-view_reasoning, I got prediction results which are almost (A). 截屏2024-04-23 20 59 05

ApolloRay commented 4 months ago

截屏2024-04-23 21 00 27 For test visual_correspondence.

zeyofu commented 4 months ago

Hi, the "full prediction" means raw model output, and "prediction" means the output after choice extraction (GPT3.5)

ApolloRay commented 4 months ago

Hi, the "full prediction" means raw model output, and "prediction" means the output after choice extraction (GPT3.5)

But I got totally different result. LLaVa offer a online demo, can you use the same prompt and the same concat image as input to get the same full prediction result ? Maybe you can offer three images and three prompts in this demo.

zeyofu commented 4 months ago

Hi it seems our settings are different. we use temperature=0 and locally ran llava1.6. Can you try this setting?

ApolloRay commented 4 months ago

截屏2024-05-06 14 58 48 Test difference.

ApolloRay commented 4 months ago

In saved output, llava v1.5 13b,full prediction gave full sentences. But in llava v1.6 34b, full predictions were just A/B/C/D ?