Open sanjari-orb opened 2 months ago
For Classifieds we implemented the per example reset, but there's no way at present to do it fast enough for shopping/reddit (about 2 mins for a full reset). Resetting the environment after every batch is kind of a compromise between resetting only at the end of each run (like WebArena recommends, which can sometimes lead to intermediate examples not working properly, e.g., if the cart is full by example 50) and resetting after every single example (which takes too long).
Hope that helps!
I am not sure I follow, why are we resetting when require_reset=False instead of True? (https://github.com/web-arena-x/visualwebarena/blob/b56b6d821e0b0f926fb940a7efe7d3f1246eab36/browser_env/envs.py#L155)
Is there some measure of how much variance in the benchmark can accumulate in this batched reset setting? Does the batch size you selected ensure that the all the examples will always work? (as opposed to not resetting at all)
Hi, I am looking at the scripts
run_reddit_som.sh, run_shopping_som.sh, run_classifieds_som.sh
. IIUC, they all involve creating batches of indices and the docker gets reset between each of these batches.https://github.com/web-arena-x/visualwebarena/blob/b56b6d821e0b0f926fb940a7efe7d3f1246eab36/scripts/run_reddit_som.sh#L21
However, I think more than one example with
require_reset: True
can occur in every batch based on the raw config JSON files (eg: https://github.com/web-arena-x/visualwebarena/blob/main/config_files/vwa/test_classifieds.raw.json).If that is the case, what is the point of resetting and how are we ensuring correctness of the run scripts?