Open yyht opened 6 days ago
All evaluations in our paper (including other baselines) are based on the DART-Math framework, and we provide results on the College Math and Olympiad Bench datasets in DART-Math-Eval-Results.zip.
We also used the Qwen2-eval toolkit to evaluate Qwen2-Math-7B-ScaleQuest on College Math. The results are available in Qwen2-Toolkit-Eval-Results.zip, with College Math achieving an accuracy of 46.0.
To summarize:
While different evaluation frameworks yield different results, the relative performance of comparable models remains consistent. We recommend checking if the correct prompt template was used (note that Qwen2’s template differs from Qwen2.5).
we use template: ( "<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n" "<|im_start|>user\n{input}\nPlease reason step by step, and put your final answer within \boxed{{}}.<|im_end|>\n" "<|im_start|>assistant\n", "{output}", "\n\n", ) which is qwen2-boxed template. I run five-times, the results are similar:
My vx: uects-thu-htxu, I may miss something.
It seems "User does not exist". My vx: yyding01
local-infer using qwen2-math-evaluation toolkit:
however, the official results in paper espscially the colledge-math:
using Qwen2-Math-7B-ScaleQuest model
colledge-math(reported): 50.8 olympiadbench(reported): 38.5
colledge-math(reproduce): 40.8 olympiadbench(reproduce): 35.4