web-arena-x / webarena

Code repo for "WebArena: A Realistic Web Environment for Building Autonomous Agents"
https://webarena.dev
Apache License 2.0
632 stars 90 forks source link

fuzzy match gives the wrong answer in eval #139

Open cheng-tan opened 1 month ago

cheng-tan commented 1 month ago

for task 361:

our agent gave the answer: Order number 170 is Canceled, order number 189 is Pending

the evaluator is using fuzzy match and evaluated our answer as wrong:

        "eval_types": [
            "string_match"
        ],
        "reference_answers": {
            "fuzzy_match": [
                "170: cancelled",
                "189: pending"
            ]
        },
        "reference_url": "",
        "program_html": [],
        "string_note": "",
        "reference_answer_raw_annotation": "170: cancelled, 189: pending"
    },