Closed zuwenqiang closed 1 year ago
Relevant to the above:
How are VQA-RAD/SLAKE evaluated for the blank-trained models? The paper describes an ACC metric to find the closest possible response (from the limited vocabulary of typical VQA datasets). But the code never uses difflib.SequenceMatcher
except when comparing to the letters A/B/C/D (presumably for the MCQ variant, not the blank one). Specifically, could the authors please clarify:
Hi, it is a great work! However, I'm confused whether PMC-VQA is an open-ended or close-ended task. The paper states that it is an open-ended task, but the dataset used is a classification dataset and Figure 3 in the paper also depicts a classification-based VQA.
PMC-VQA can be used for both open-ended and close-ended, as illustrated in Sec. 4.2. If used for open-ended task, the highlighted answer in Figure 3 is the ground truth.
Relevant to the above: How are VQA-RAD/SLAKE evaluated for the blank-trained models? The paper describes an ACC metric to find the closest possible response (from the limited vocabulary of typical VQA datasets). But the code never uses
difflib.SequenceMatcher
except when comparing to the letters A/B/C/D (presumably for the MCQ variant, not the blank one). Specifically, could the authors please clarify:
- Are the ACC metrics on existing VQA benchmarks reported by comparing generated answers to all possible candidate answers in their respective vocabularies?
- How is End-of-Sequence determined for TD models, and how many masks are to be generated for TE models - in order to retrieve the model's predicted answer to a given question without access to ground truth?
Thank you so much for your response. I really appreciate it and will definitely keep following this project.
Relevant to the above: How are VQA-RAD/SLAKE evaluated for the blank-trained models? The paper describes an ACC metric to find the closest possible response (from the limited vocabulary of typical VQA datasets). But the code never uses
difflib.SequenceMatcher
except when comparing to the letters A/B/C/D (presumably for the MCQ variant, not the blank one). Specifically, could the authors please clarify:
Are the ACC metrics on existing VQA benchmarks reported by comparing generated answers to all possible candidate answers in their respective vocabularies?
How is End-of-Sequence determined for TD models, and how many masks are to be generated for TE models - in order to retrieve the model's predicted answer to a given question without access to ground truth?
Yes, note that if the ground truth answer is "T1", we considered the model output like " T1 image", and "T1 MRI " to be correct.
For the TE model, when ground truth is not available, we can set a default length and determine where it ends by the special token on the output.
Thank you!
When answering a closed ended vqarad question, will the set of possible answers (to difflib against) be the set of all open ended questions or the set of all open+closed ended questions?
When reporting results on vqarad, is ground truth considered to be available? If so, will exactly as many masks be provided as are the tokens in ground truth? Also is the special token id = 2?
Hi, it is a great work! However, I'm confused whether PMC-VQA is an open-ended or close-ended task. The paper states that it is an open-ended task, but the dataset used is a classification dataset and Figure 3 in the paper also depicts a classification-based VQA.