zorazrw / agent-workflow-memory

AWM: Agent Workflow Memory
Apache License 2.0
210 stars 16 forks source link

Evaluation question #4

Open Serega6678 opened 1 month ago

Serega6678 commented 1 month ago

Thank you very much for your work @zorazrw !!! Honestly, very impressive!

I see that you're using auto eval and it uses gpt-3.5 My questions are:

  1. Why did you use auto eval instead of taking the reward from the webarena env directly? (If it led to better performance, could you please give a bit more info)
  2. Why did you use gpt-3.5 instead of more powerful gpt-4 for evals?
  3. In your paper you mention that using HTML + AX tree is better than AX tree only. In your evaluation (where you got 35.5 success rate) did you use the HTML and AXT or only AXT?

Thank you again, very relevant and important paper!

zorazrw commented 1 month ago

Hi! To your questions:

  1. Webarena environment reward could be seen as disclosing ground-truth feedback, whereas real-world webpages don't give agents that feedback automatically. We want to experiment things in the most realistic setting without gt rewards, thus the agent needs to use other means, e.g., model-based evaluation, for measuring the correctness of past trajectories. We did explore using gt reward and a side observation is that, the model-based reward is more helpful than gt reward, due to the strict evaluation design that may result in many false negatives.
  2. gpt-3.5 just happens to be the default option. In our experiments, we set the same model for generation, evaluation, and induction.
  3. All our experiments are only AXT.