Closed basujindal closed 1 year ago
Sorry for the confusion. The evaluation of multiple-choice task and open-ended task are different. For multiple-choice task, the output will be (A,B,C,D), which we compare directly to the real choice (also A,B,C,D). For the open-ended task, the output will be a few words, so the prediction code for the open-ended task is as in lines 155-161 in test_slake.py, not ``pred = generated_texts[i][-1]''.
In the test.py file line 138:
I understand that pred = generated_texts[i][-1] essentially takes the last token generated (which is usually A, B, C, D), and compares it with the ground truth (which is a few words long like MRI, CT Scan, None of the above).
Can the authors please clarify if that indeed is the case and if that would be a fair comparison? Thank you