web-arena-x / visualwebarena

VisualWebArena is a benchmark for multimodal agents.
https://jykoh.com/vwa
MIT License
230 stars 43 forks source link

Need for resetting in run som scripts #60

Open sanjari-orb opened 2 months ago

sanjari-orb commented 2 months ago

Hi, I am looking at the scripts run_reddit_som.sh, run_shopping_som.sh, run_classifieds_som.sh. IIUC, they all involve creating batches of indices and the docker gets reset between each of these batches.

https://github.com/web-arena-x/visualwebarena/blob/b56b6d821e0b0f926fb940a7efe7d3f1246eab36/scripts/run_reddit_som.sh#L21

However, I think more than one example with require_reset: True can occur in every batch based on the raw config JSON files (eg: https://github.com/web-arena-x/visualwebarena/blob/main/config_files/vwa/test_classifieds.raw.json).

If that is the case, what is the point of resetting and how are we ensuring correctness of the run scripts?

kohjingyu commented 2 months ago

For Classifieds we implemented the per example reset, but there's no way at present to do it fast enough for shopping/reddit (about 2 mins for a full reset). Resetting the environment after every batch is kind of a compromise between resetting only at the end of each run (like WebArena recommends, which can sometimes lead to intermediate examples not working properly, e.g., if the cart is full by example 50) and resetting after every single example (which takes too long).

Hope that helps!

sanjari-orb commented 2 months ago

I am not sure I follow, why are we resetting when require_reset=False instead of True? (https://github.com/web-arena-x/visualwebarena/blob/b56b6d821e0b0f926fb940a7efe7d3f1246eab36/browser_env/envs.py#L155)

Is there some measure of how much variance in the benchmark can accumulate in this batched reset setting? Does the batch size you selected ensure that the all the examples will always work? (as opposed to not resetting at all)