yingweima2022 / CodeLLM

8 stars 0 forks source link

Some questions about the paper. #3

Open handsomelys opened 2 months ago

handsomelys commented 2 months ago

ScienceQA was evaluated in your experiments. As I understand, ScienceQA is a benchmark associated with multi-modal tasks, whereas your model operates purely within the realm of text. Could you please elaborate on how this particular aspect of evaluation was conducted? I would greatly appreciate your response.

handsomelys commented 2 months ago

The paper is "AT WHICH TRAINING STAGE DOES CODE DATA HELP LLMS REASONING?"

yingweima2022 commented 2 months ago

Thank you for your interest in our work.

To clarify, although ScienceQA is a benchmark that is typically associated with multi-modal tasks, our approach focuses on text-based reasoning. We selected this dataset because science tasks often require domain-specific knowledge and explicit multi-hop reasoning, which aligns well with the capabilities of our model.

As mentioned in Section 5.1 of the paper (Experimental Setup) [1], the heuristics and VQA baselines treat the ScienceQA task as a multi-class classification problem with multiple options, and they are evaluated using accuracy metrics. However, for models like UnifiedQA and GPT-3, including our approach, ScienceQA is treated as a text generation problem. We followed the same methodology, treating the task as a text generation challenge, and evaluated our model accordingly.

I hope this clarifies the using process.

[1] Pan Lu, Swaroop Mishra, Tony Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. In The 36th Conference on Neural Information Processing Systems (NeurIPS), 2022.
handsomelys commented 2 months ago

Thank you for your interest in our work.

To clarify, although ScienceQA is a benchmark that is typically associated with multi-modal tasks, our approach focuses on text-based reasoning. We selected this dataset because science tasks often require domain-specific knowledge and explicit multi-hop reasoning, which aligns well with the capabilities of our model.

As mentioned in Section 5.1 of the paper (Experimental Setup) [1], the heuristics and VQA baselines treat the ScienceQA task as a multi-class classification problem with multiple options, and they are evaluated using accuracy metrics. However, for models like UnifiedQA and GPT-3, including our approach, ScienceQA is treated as a text generation problem. We followed the same methodology, treating the task as a text generation challenge, and evaluated our model accordingly.

I hope this clarifies the using process.

[1] Pan Lu, Swaroop Mishra, Tony Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. In The 36th Conference on Neural Information Processing Systems (NeurIPS), 2022.

My understanding is that your ScienceQA model overlooks the image components within the original benchmark, concentrating exclusively on the textual data within the questions to generate its answers. Is this understanding correct?

yingweima2022 commented 2 months ago

Hi, ScienceQA input includes Question, Context/Images, and Multiple Options. We follow the original approach and use QCM as input, where Context is the description of images. You can see section 5.1 in ScienceQA paper.

https://lupantech.github.io/papers/neurips22_scienceqa.pdf

handsomelys commented 2 months ago

Hi, ScienceQA input includes Question, Context/Images, and Multiple Options. We follow the original approach and use QCM as input, where Context is the description of images. You can see section 5.1 in ScienceQA paper.

https://lupantech.github.io/papers/neurips22_scienceqa.pdf

Thanks!