Clarification regarding model evaluation

xiaoman-zhang / PMC-VQA

PMC-VQA is a large-scale medical visual question-answering dataset, which contains 227k VQA pairs of 149k images that cover various modalities or diseases.

MIT License

174 stars 11 forks source link

Clarification regarding model evaluation #9

Closed basujindal closed 1 year ago

basujindal commented 1 year ago

In the test.py file line 138:

I understand that pred = generated_texts[i][-1] essentially takes the last token generated (which is usually A, B, C, D), and compares it with the ground truth (which is a few words long like MRI, CT Scan, None of the above).

Can the authors please clarify if that indeed is the case and if that would be a fair comparison? Thank you

xiaoman-zhang commented 1 year ago

Sorry for the confusion. The evaluation of multiple-choice task and open-ended task are different. For multiple-choice task, the output will be (A,B,C,D), which we compare directly to the real choice (also A,B,C,D). For the open-ended task, the output will be a few words, so the prediction code for the open-ended task is as in lines 155-161 in test_slake.py, not ``pred = generated_texts[i][-1]''.