web-arena-x / webarena

Code repo for "WebArena: A Realistic Web Environment for Building Autonomous Agents"
https://webarena.dev
Apache License 2.0
696 stars 108 forks source link

Overuse of exact matches in eval. #134

Open afourney opened 5 months ago

afourney commented 5 months ago

I would like to re-open issue #104

There's an overuse of exact matches in the eval harness. For example, consider task 649:

 "intent": "Post in history subreddit about what could diffusion model help the correpong field.",
...
          "required_contents": {
            "must_include": [
              "diffusion model",
              "help"
            ]
          }

Our agent created the following post, but failed the test because it used the word "benefit" rather than the word "help": image

This is overly rigid, and it is not an outlier. We should move to LLM matching, or make the intent more explicit.

afourney commented 5 months ago

@deepak-akkil

afourney commented 5 months ago

Here's another example. Task 643.

    "intent": "Post a notice on a virtual meetup for racing cars enthusiasts on Oct 21st in the nyc subreddit",

    "required_contents": {
            "must_include": [
              "racing cars",
              "Oct 21st",
              "virtual meetup"
            ]
          }

Our agent posted:

image

Which I believe is failing because of the use of "Racing car" (singular), rather than "racing cars" (plural), and possibly also the use of "October 21st" rather than "Oct 21st" -- though I've not traced this failure to confirm.

shuyanzhou commented 5 months ago

Hi @afourney this is a great point! We attempted to minimize the number of LLM-based evaluations to keep the evaluation straightforward. However, due to the different prompts and implementations, the agents can behave differently, but all correctly. I am re-examing the annotations to see to what extent and how we want to update the verifiers. We also want to make sure the comparisons across works are fair.

afourney commented 5 months ago

Agreed that changing scoring function now would diverge from runs already completed. Perhaps some versioning of the benchmark would be key here. But I'm pretty convinced that addressing such issues will improve the quality and utility of WebArena.

gagb commented 5 months ago

Same issue with task 27 -- string match requires 0. Our agents said zero and it got a fail.

gagb commented 5 months ago

Hi @afourney this is a great point! We attempted to minimize the number of LLM-based evaluations to keep the evaluation straightforward. However, due to the different prompts and implementations, the agents can behave differently, but all correctly. I am re-examing the annotations to see to what extent and how we want to update the verifiers. We also want to make sure the comparisons across works are fair.

I think we need to at least (vastly) expand the patterns used for exact match. Or get rid of exact match and only use LLMs. The cost of LLM evaluation seems to be lower than what agents typically use to solve the task so it does sound like a big deal. If noise in evaluation is a worry, then it isn't clear which is noisier -- exact match (because of low recall) or LLM based eval (because of stochasticity in generations).

peterychang commented 4 months ago

Task 266 has the same issue

    "intent": "What's the closest national park to the largest city in Maine?",
    "eval": {
      "eval_types": [
        "string_match"
      ],
      "reference_answers": {
        "exact_match": "Acadia National Park"
      },

Our agent responded with Acadia National Park is the closest national park to Portland, Maine.

shuyanzhou commented 4 months ago

Thanks for the excellent catches. We are working on releasing the new evaluation, which will address such false negative judgments.