Open sasaadi opened 1 year ago
I also have this confusion.
Much curious about that, too.
Yes, during evaluation, for both types of questions, we collected a list of labels from the test set. Once these lists were obtained, we employed a string similarity function to compare the model's generated answers against the label list. The answer with the highest similarity score is used as the model's prediction, and used for calculating accuracy.
Hi, thank you for providing the code for finetuning the model. To be able to reproduce your results in the paper, I would like to know how you computed the accuracy on close- and open-ended questions in VQA-RAD and Slake.
Can you confirm that for close-ended questions, you get the set of all answers from the "CLOSE" type questions in both test and train sets of each dataset and call find_most_similar_index() as in the test.py script?
And for open-ended questions, you get the set of all answers from "OPEN" type questions in both test and train sets of each dataset and call find_most_similar_index() as in the test.py script?
Thank you