xiaoman-zhang / PMC-VQA

PMC-VQA is a large-scale medical visual question-answering dataset, which contains 227k VQA pairs of 149k images that cover various modalities or diseases.
MIT License
174 stars 11 forks source link

Open-ended or close-ended? #3

Closed zuwenqiang closed 1 year ago

zuwenqiang commented 1 year ago

Hi, it is a great work! However, I'm confused whether PMC-VQA is an open-ended or close-ended task. The paper states that it is an open-ended task, but the dataset used is a classification dataset and Figure 3 in the paper also depicts a classification-based VQA.

avi-otterai commented 1 year ago

Relevant to the above: How are VQA-RAD/SLAKE evaluated for the blank-trained models? The paper describes an ACC metric to find the closest possible response (from the limited vocabulary of typical VQA datasets). But the code never uses difflib.SequenceMatcher except when comparing to the letters A/B/C/D (presumably for the MCQ variant, not the blank one). Specifically, could the authors please clarify:

  1. Are the ACC metrics on existing VQA benchmarks reported by comparing generated answers to all possible candidate answers in their respective vocabularies?
  2. How is End-of-Sequence determined for TD models, and how many masks are to be generated for TE models - in order to retrieve the model's predicted answer to a given question without access to ground truth?
xiaoman-zhang commented 1 year ago

Hi, it is a great work! However, I'm confused whether PMC-VQA is an open-ended or close-ended task. The paper states that it is an open-ended task, but the dataset used is a classification dataset and Figure 3 in the paper also depicts a classification-based VQA.

PMC-VQA can be used for both open-ended and close-ended, as illustrated in Sec. 4.2. If used for open-ended task, the highlighted answer in Figure 3 is the ground truth.

xiaoman-zhang commented 1 year ago

Relevant to the above: How are VQA-RAD/SLAKE evaluated for the blank-trained models? The paper describes an ACC metric to find the closest possible response (from the limited vocabulary of typical VQA datasets). But the code never uses difflib.SequenceMatcher except when comparing to the letters A/B/C/D (presumably for the MCQ variant, not the blank one). Specifically, could the authors please clarify:

  1. Are the ACC metrics on existing VQA benchmarks reported by comparing generated answers to all possible candidate answers in their respective vocabularies?
  2. How is End-of-Sequence determined for TD models, and how many masks are to be generated for TE models - in order to retrieve the model's predicted answer to a given question without access to ground truth?
  1. Yes, note that if the ground truth answer is "T1", we considered the model output like " T1 image", and "T1 MRI " to be correct.
  2. For the TE model, when ground truth is not available, we can set a default length and determine where it ends by the special token on the output.
zuwenqiang commented 1 year ago

Thank you so much for your response. I really appreciate it and will definitely keep following this project.

avi-otterai commented 1 year ago

Relevant to the above: How are VQA-RAD/SLAKE evaluated for the blank-trained models? The paper describes an ACC metric to find the closest possible response (from the limited vocabulary of typical VQA datasets). But the code never uses difflib.SequenceMatcher except when comparing to the letters A/B/C/D (presumably for the MCQ variant, not the blank one). Specifically, could the authors please clarify:

  1. Are the ACC metrics on existing VQA benchmarks reported by comparing generated answers to all possible candidate answers in their respective vocabularies?

  2. How is End-of-Sequence determined for TD models, and how many masks are to be generated for TE models - in order to retrieve the model's predicted answer to a given question without access to ground truth?

  1. Yes, note that if the ground truth answer is "T1", we considered the model output like " T1 image", and "T1 MRI " to be correct.

  2. For the TE model, when ground truth is not available, we can set a default length and determine where it ends by the special token on the output.

Thank you!

  1. When answering a closed ended vqarad question, will the set of possible answers (to difflib against) be the set of all open ended questions or the set of all open+closed ended questions?

  2. When reporting results on vqarad, is ground truth considered to be available? If so, will exactly as many masks be provided as are the tokens in ground truth? Also is the special token id = 2?

xiaoman-zhang commented 1 year ago
  1. The set of possible answers is just the set of all closed-ended questions?
  2. For the reporting results of the TE model on VQA-RAD, it's true that the number of masks we provide is indeed the same as the number of tokens in the ground truth. If we provide masks more than the number of tokens, the output will be like: 'answer '. And yes, the special token is id=2.