web-arena-x / webarena

Code repo for "WebArena: A Realistic Web Environment for Building Autonomous Agents"
https://webarena.dev
Apache License 2.0
633 stars 90 forks source link

My agents keep trying to visit the real Reddit, and complain when they can't. #119

Closed afourney closed 1 month ago

afourney commented 3 months ago

We've recently started to onboard WebArena for evaluating AutoGen, and have encountered a persistent issue: GPT-4-based agents keep trying to visit the real Reddit website, and complain when they can't. See the attached screenshot for an example, but this is occurring in a majority of the Reddit-like tasks. I figure it's because "reddit" or "subreddit" appear throughout the benchmark, including in both the data (e.g., forums like AskReddit), questions (e.g., "Post in the most appropriate subreddit and ask for recommendations for must-have product in my life products within a budget of $30")

We are working to resolve this through prompting, but I wonder if anyone else has encountered this issue, and how it has been addressed in the past?

image

shuyanzhou commented 3 months ago

This is interesting. One solution I might think of is to replace the URL in both the action prediction and the observations to pretend it is the real Reddit. For example, you can replace goto(www.reddit.com) with goto(<reddit_server>.com). We did a similar thing when constructing the prompt

afourney commented 3 months ago

Thanks. Yes I can fudge the URLs. I’ll report back — but I think a longer term approach for a future version would be to find and replace Reddit in the dataset with something else.

shuyanzhou commented 1 month ago

but I think a longer term approach for a future version would be to find and replace Reddit in the dataset with something else.

Good point! Our original thought was that keeping the sites as realistic as possible could better elicit the model's knowledge of task executions. But that may not be necessary.