can't reproduce results

yyht commented 6 days ago

local-infer using qwen2-math-evaluation toolkit:

企业微信截图_be7cc920-c857-415b-bdd9-471c0fd54287

however, the official results in paper espscially the colledge-math:

企业微信截图_9da4a836-3aa3-4fbb-9401-b9ef498d4f8d

using Qwen2-Math-7B-ScaleQuest model

colledge-math(reported): 50.8 olympiadbench(reported): 38.5

colledge-math(reproduce): 40.8 olympiadbench(reproduce): 35.4

yyDing1 commented 6 days ago

All evaluations in our paper (including other baselines) are based on the DART-Math framework, and we provide results on the College Math and Olympiad Bench datasets in DART-Math-Eval-Results.zip.

We also used the Qwen2-eval toolkit to evaluate Qwen2-Math-7B-ScaleQuest on College Math. The results are available in Qwen2-Toolkit-Eval-Results.zip, with College Math achieving an accuracy of 46.0.

To summarize:

DART-Math eval toolkit (reported in our paper): Qwen2-Math-Ins: 50.5, Ours: 50.1
Qwen2 eval toolkit: Qwen2-Math-Ins: 45.9, Ours: 46.0

While different evaluation frameworks yield different results, the relative performance of comparable models remains consistent. We recommend checking if the correct prompt template was used (note that Qwen2’s template differs from Qwen2.5).

yyht commented 6 days ago

we use template: ( "<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n" "<|im_start|>user\n{input}\nPlease reason step by step, and put your final answer within \boxed{{}}.<|im_end|>\n" "<|im_start|>assistant\n", "{output}", "\n\n", ) which is qwen2-boxed template. I run five-times, the results are similar:

企业微信截图_1c92a0f0-b083-4625-a9fc-d15fd4ec23e9

My vx: uects-thu-htxu, I may miss something.

yyDing1 commented 6 days ago

It seems "User does not exist". My vx: yyding01

yyDing1 / ScaleQuest

can't reproduce results #4