xxcg322 / CryptoBench

Other
30 stars 1 forks source link

[proposal] Refactor the dataset format #10

Open jolestar opened 1 week ago

jolestar commented 1 week ago

Motivation

Refactor the dataset format and make it easier to contribute and reuse.

  1. Extract the questions option.
  2. Make one file per question.

001.json

{
    "id": "001",
    "categories": [
      "Bitcoin",
      "Blockchain Fundamental",
      "Knowledge",
      "Beginner"
    ],
    "question": "Which of the following is NOT one of the main components of the Bitcoin system",
    "options": [
      "Users with wallets containing keys",
      "Transactions propagated across the network",
      "Miners producing the consensus blockchain",
      "Central banks regulating the currency"
    ],
    "answer": 3,
}
xxcg322 commented 1 week ago

Thank you for the proposal! Will proceed with updating the multiple-choice format based on your suggestions for future iterations.

However, I would like to share that in our early experiments, we found that the multiple-choice format was not ideal for crypto benchmark. While high-capability models indeed performed better on these questions, the observed differences were much smaller than the actual performance gap between models. This indicates that simple multiple-choice questions aren't particularly effective at differentiating between the true capabilities of different models. That said, multiple-choice questions helped us conduct many initial experiments quickly, which made a significant contribution to advancing the benchmark project. After careful consideration, though, we decided to remove the multiple-choice results from the final Leaderboard, opting instead for more complex task-based questions, as well as new real-world challenges in the future. The real-world challenge tasks will require the use of agent frameworks, and we are currently testing an MVP agent setup while also discussing potential integration opportunities with third-party agent teams.

Nonetheless, as you mentioned, we believe multiple-choice questions still hold value for various applications, even if they've been deprecated in the Leaderboard. Therefore, updating the format as suggested is meaningful work, and we appreciate your efforts in proposing these improvements.