Evaluation question - Githubissues

zorazrw / agent-workflow-memory

AWM: Agent Workflow Memory

Apache License 2.0

210 stars 16 forks source link

Hi! To your questions:

Webarena environment reward could be seen as disclosing ground-truth feedback, whereas real-world webpages don't give agents that feedback automatically. We want to experiment things in the most realistic setting without gt rewards, thus the agent needs to use other means, e.g., model-based evaluation, for measuring the correctness of past trajectories. We did explore using gt reward and a side observation is that, the model-based reward is more helpful than gt reward, due to the strict evaluation design that may result in many false negatives.
gpt-3.5 just happens to be the default option. In our experiments, we set the same model for generation, evaluation, and induction.
All our experiments are only AXT.

zorazrw / agent-workflow-memory