web-arena-x / visualwebarena

VisualWebArena is a benchmark for multimodal agents.
https://jykoh.com/vwa
MIT License
246 stars 45 forks source link

Classifieds is impossible to evaluate for vision-based CC agents #70

Open liaopeiyuan opened 2 weeks ago

liaopeiyuan commented 2 weeks ago

Playwright is unable to render <select> elements as they are OS-native. This would cause any vision-based computer control agent (e.g. Claude 3.5 Sonnet) to be unable to interact with the Category element as it simply would not render.

Screenshot 2024-11-02 at 8 28 36 PM

I also attached a short recording of a Chromium session to illustrate my point.

https://github.com/user-attachments/assets/be7ae5cc-39a2-4821-b973-2ac0947c44b2

Might this be a limitation of the benchmark? This is not really a limitation of a computer-control agent but rather of the evaluation harness. One modification could be to update the source code so that all Category selections are renderable by the browser.

Alternatively, I may be missing some tricks to allow a headful browser to render it, but such techniques are not known to me.

Any guidance is appreciated!

kohjingyu commented 2 weeks ago

Do you have a script to replicate this easily? When we get our SoM agents to click on a dropdown it seems to be captured in the screenshot outputs: download

so is not an issue that we've faced so far. In the screenshot above there's some problems with the SoM not covering the items but that's more of a limitation of our SoM implementation specifically.

liaopeiyuan commented 2 weeks ago

We are using the AMI ami-080f6d73cfce497a1 on @shuyanzhou's 2.0 branch and locally a Mac to run the tests.

We ran run_classifieds_som.sh with rendering and could not see the dropdown rendered in the SoM screenshots. We also manually dumped screenshots within the code and were unable to get it to display.

It does seem that "Sort By" is indeed a <select> HTML element as well. I wonder if there are compatibility issues with certain client versions of Playwright on different platforms that make it unable to render the selects. Would you be running the evaluation on a Linux/Windows client? Is the browser headless or headful?

We currently have the server up via a public IP and tested a third-party service (https://www.browserbase.com/) and were unable to render the dropdown as well. What would be a good email to reach out to? If you're interested, we can send you the IP address.