xxcg322 / CryptoBench

Other
34 stars 2 forks source link

About the order of multiple-choice dataset #11

Closed Silas-Xu closed 1 month ago

Silas-Xu commented 1 month ago

When designing a multiple-choice question dataset, should the impact of option order on evaluation results be considered? Would it be better to extract the options as single fields? arxiv REF: Large Language Models Sensitivity to The Order of Options in Multiple-Choice Questions

xxcg322 commented 1 month ago

The paper cited is highly valuable, providing a wealth of crucial information. Thanks!

In the initial version of the project, each question contained extensive supplementary information, with options presented as separate. However, as the workload increased, options are integrated directly into questions for a simpler format to facilitate a rapid proof of concept. This approach, while expedient, is acknowledged to lack scientific rigor - an issue recently raised by another contributor. There is already a plan to process the questions in batches over the coming weeks to extract the options for multiple-choice questions and the paper you cited will definitely help during that process.

I would like to share more backgrounds around Multichoice questions. Early experiments revealed that the multiple-choice format was suboptimal for the crypto benchmark. Although high-capability models did perform better on these questions, the performance gaps were significantly smaller than the actual differences in model capabilities. This suggests that simple multiple-choice questions are not particularly effective in distinguishing the true capabilities of various models.

Nevertheless, multiple-choice questions were instrumental in conducting numerous initial experiments quickly, substantially advancing the benchmark project. After careful consideration, the decision was made to exclude multiple-choice results from the final Leaderboard. Instead, the focus has shifted to more complex, task-based questions and the development of new real-world challenges for future iterations.

Nonetheless, we believe multiple-choice questions still hold value for various applications, even if they've been deprecated in the Leaderboard. Therefore, updating the format as suggested is meaningful work, and we appreciate your efforts in proposing these improvements.