web-arena-x / webarena

Code repo for "WebArena: A Realistic Web Environment for Building Autonomous Agents"
https://webarena.dev
Apache License 2.0
708 stars 110 forks source link

Could I do multi-thread evaluation? #161

Open Hodge931 opened 2 months ago

Hodge931 commented 2 months ago

To speed up the evaluation, I would like to evaluate, say 64 examples in parallel with multiple threads. Does this affect the correctness of the evaluation? Thanks a lot!

shuyanzhou commented 2 months ago

That may affect the results. The reason is that we deliberately design the order of examples so that former examples won't affect later examples.

This is the script for 4 parallel runs. You can also reset the environment more frequently to avoid the inter-example influence.

Hodge931 commented 2 months ago

Thanks a lot for the reply!

  1. In my understanding, with the reset environment, the evaluation of each example is correct. Therefore, I may set up two AWS instances, and evaluate, say examples 1-406 with instance 1, and examples 407-812 with instance 2. Is such evaluation correct?
  2. Sometimes errors may happen in the middle. For example, if the evaluation of the 10th example breaks down, could I just continue to evaluate the 11th example without re-evaluating the first 10 examples and without resetting environments?

Your kind suggestions are highly appreciated!

dryingpaint commented 2 months ago

Hello! Do you mind elaborating on how the earlier tasks are dependent on later tasks? Is there any way to launch separate sites for each new task that we're evaluating so that we can run multiple agents at the same time? How often should the environment resets be happening? Thanks for you help :)

leoozy commented 1 month ago

Hello, do you have any advise on how to set up multiple dockers for the same website. For example, we can set up 10 shoping weisite with different port. So we can parallel evaluate it. Thank you!