web-arena-x / webarena

Code repo for "WebArena: A Realistic Web Environment for Building Autonomous Agents"
https://webarena.dev
Apache License 2.0
647 stars 94 forks source link

Indeterminism in shopping-admin best-seller report leads to very flaky eval #100

Open moladeyal opened 5 months ago

moladeyal commented 5 months ago

There's indeterminism in the shopping-admin website, in the best sellers report. To recreate, visit this bestsellers report url and click "show report" several times. Each time it is clicked, there will be a random set of products shown.

This means that the for the tasks with ids 0-6 (e.g. "What is the top-1 best-selling product type in Quarter 1 2022"), there's a very high chance to get a different answer each time you execute, which will fail the current eval (more often than not).

A potential fix is to change the magneto configuration to allow for more then 5 rows per aggregate in the bestsellers report such that all items are shown. See this link.

Could you please advise? Thanks, Eyal

shuyanzhou commented 5 months ago

Thank you for pointing out this potential issue! We are looking into this!

shuyanzhou commented 3 months ago

Sorry for the late response. I failed to reproduce the randomness from my end. Can you give a more concrete example like what are values in the three fields, and what are the outputs each time? Thank you!

janekzimoch commented 3 months ago

I tried recreating this as well, and I think the output is deterministic, so I don't think there is any bug on the DB side.

I think what @moladeyal saw was that whenever "Order quantity" for multiple items is the same, their ordering is random which is not a bug.

However this example made me capture another bug/weakness of evaluation data. "0.json" says that correct answer is: "Quest Lumaflex\u2122 Band" while there seem to be three correct answers: 1) Dash Digital Watch (order quantity == 3 in 1Q 2022) 2) Quest Lumaflex™ Band (order quantity == 3 in 1Q 2022) 3) or both as their order quantities are the same.