Open jolestar opened 1 month ago
Thank you for the proposal! Will proceed with updating the multiple-choice format based on your suggestions for future iterations.
However, I would like to share that in our early experiments, we found that the multiple-choice format was not ideal for crypto benchmark. While high-capability models indeed performed better on these questions, the observed differences were much smaller than the actual performance gap between models. This indicates that simple multiple-choice questions aren't particularly effective at differentiating between the true capabilities of different models. That said, multiple-choice questions helped us conduct many initial experiments quickly, which made a significant contribution to advancing the benchmark project. After careful consideration, though, we decided to remove the multiple-choice results from the final Leaderboard, opting instead for more complex task-based questions, as well as new real-world challenges in the future. The real-world challenge tasks will require the use of agent frameworks, and we are currently testing an MVP agent setup while also discussing potential integration opportunities with third-party agent teams.
Nonetheless, as you mentioned, we believe multiple-choice questions still hold value for various applications, even if they've been deprecated in the Leaderboard. Therefore, updating the format as suggested is meaningful work, and we appreciate your efforts in proposing these improvements.
Motivation
Refactor the dataset format and make it easier to contribute and reuse.
001.json