web-arena-x / webarena

Code repo for "WebArena: A Realistic Web Environment for Building Autonomous Agents"
https://webarena.dev
Apache License 2.0
647 stars 94 forks source link

Use of exact_match in evaluation Harness #104

Closed deepak-akkil closed 3 months ago

deepak-akkil commented 5 months ago

It seems many of the evaluations consists of exact string matches. For example, in task 0: What is the top-1 best-selling product in 2022 The evaluation is string match with exact_match as the condition:

 "eval_types": [
      "string_match"
    ],
    "reference_answers": {
      "exact_match": "Quest Lumaflex\u2122 Band"
    }

All variation of this answer such as "The top selling product in 2022 is Quest Lumaflex\u2122 Band" or "Quest Lumaflex\u2122 Band is the best selling product in 2022"or "Quest Lumaflex\u2122 Band was sold the most in 2022" should be perfectly reasonable answers.

Any specific reason why the exact_match is used here , instead of fuzzy_match?

shuyanzhou commented 4 months ago

Thank you very much, this is a good point. The main reason we use exact_match is to avoid false positives (e.g., Quest Lumaflex\u2122 Band, product_2, product_3, ...). Empirically, we found that many models, including GPT-4 , were able to produce the exact product name without additional information, they could follow the answer format as in the in-context examples.

It is possible to use must_include as long as we make sure no additional product is considered. We also try to avoid fuzzy_match whenever it is possible since we don't have full control of the APIs.