Thank you very much for your work @zorazrw !!!
Honestly, very impressive!
I see that you're using auto eval and it uses gpt-3.5
My questions are:
Why did you use auto eval instead of taking the reward from the webarena env directly? (If it led to better performance, could you please give a bit more info)
Why did you use gpt-3.5 instead of more powerful gpt-4 for evals?
In your paper you mention that using HTML + AX tree is better than AX tree only. In your evaluation (where you got 35.5 success rate) did you use the HTML and AXT or only AXT?
Thank you again, very relevant and important paper!
Webarena environment reward could be seen as disclosing ground-truth feedback, whereas real-world webpages don't give agents that feedback automatically. We want to experiment things in the most realistic setting without gt rewards, thus the agent needs to use other means, e.g., model-based evaluation, for measuring the correctness of past trajectories. We did explore using gt reward and a side observation is that, the model-based reward is more helpful than gt reward, due to the strict evaluation design that may result in many false negatives.
gpt-3.5 just happens to be the default option. In our experiments, we set the same model for generation, evaluation, and induction.
Thank you very much for your work @zorazrw !!! Honestly, very impressive!
I see that you're using auto eval and it uses gpt-3.5 My questions are:
Thank you again, very relevant and important paper!