Use of exact_match in evaluation Harness

web-arena-x / webarena

Code repo for "WebArena: A Realistic Web Environment for Building Autonomous Agents"

Apache License 2.0

647 stars 94 forks source link

Thank you very much, this is a good point. The main reason we use exact_match is to avoid false positives (e.g., Quest Lumaflex\u2122 Band, product_2, product_3, ...). Empirically, we found that many models, including GPT-4 , were able to produce the exact product name without additional information, they could follow the answer format as in the in-context examples.

It is possible to use must_include as long as we make sure no additional product is considered. We also try to avoid fuzzy_match whenever it is possible since we don't have full control of the APIs.

web-arena-x / webarena

Use of exact_match in evaluation Harness #104