web-arena-x / visualwebarena

VisualWebArena is a benchmark for multimodal agents.
https://jykoh.com/vwa
MIT License
232 stars 44 forks source link

Errors in annotation #41

Closed leoozy closed 5 months ago

leoozy commented 5 months ago

I found some errors in annotion. In the classifieds_10: sites: ['classifieds'] task_id: 10 require_login: True storage_state: ./.auth/classifieds_state.json start_url: http://localhost:9980 geolocation: None intent_template: What is the {{attribute}} of {{item}}? intent: What is the seat height in inches of the smaller piece of furniture on this page? image: None instantiation_dict: {'attribute': 'seat height in inches', 'item': 'the smaller piece of furniture on this page'} require_reset: False eval: {'eval_types': ['string_match'], 'reference_answers': {'exact_match': '21'}, 'reference_url': 'http://localhost:9980/index.php?page=item&id=43887', 'program_html': [], 'string_note': '', 'reference_answer_raw_annotation': ''} reasoning_difficulty: easy visual_difficulty: easy overall_difficulty: easy comments: intent_template_id: 5

The output is 21 inches, which I think is correct.

In classified 142, the agent found wrong things in the trace of GPT4V. But it is evaluated as correct.

leoozy commented 5 months ago

I found many similar cases to classifieds 10. Maybe 10 inches can also be seen as a correct answer.

leoozy commented 5 months ago

In classifieds 12, the image is: 4T)$1FCX`B_L}H RM%F$Z D The task is: What is the color of most recently listed motorcycle? The answer of GPT4-V is: red and black, The label is "red". I think the GPT4-V is correct

leoozy commented 5 months ago

In classified 21: the output url is: URL: http://127.0.0.1:9980/index.php?page=item&id=33164 The label is'http://localhost:9980/index.php?page=item&id=33164. I think they are exactaly the same thing.,

leoozy commented 5 months ago

sites: ['classifieds'] task_id: 30 require_login: True storage_state: ./.auth/classifieds_state.json start_url: http://localhost:9980 geolocation: None intent_template: Add a comment on the {{item}} with the title "{{title}}" and text "{{comment}}". intent: Add a comment on the most expensive black couch with the title "Interesting Couch" and text "Is the price negotiable?". image: None instantiation_dict: {'item': 'most expensive black couch', 'title': 'Interesting Couch', 'comment': 'Is the price negotiable?'} require_reset: True eval: {'eval_types': ['program_html'], 'reference_answers': None, 'reference_url': '', 'program_html': [{'url': 'http://localhost:9980/index.php?page=item&id=44542', 'locator': "func:get_query_text(page, '.comments_list h3')", 'required_contents': {'must_include': ['Interesting Couch by Blake Sullivan']}}, {'url': 'http://localhost:9980/index.php?page=item&id=44542', 'locator': "func:get_query_text(page, '.comments_list')", 'required_contents': {'must_include': ['Is the price negotiable?']}}]} reasoning_difficulty: hard visual_difficulty: easy overall_difficulty: medium comments: http://localhost:9980/index.php?page=item&id=44542 intent_template_id: 11

The agent is asked to type the titile as "Interesting Couch but the evalution is 'Interesting Couch by Blake Sullivan'

kohjingyu commented 5 months ago

Thanks for pointing these out! I've updated the exact_match cases to use must_include instead to avoid these false negatives in https://github.com/web-arena-x/visualwebarena/commit/c3b7b7d1bce29d58ff89f87e6f4b7a1bbced8843. For the other two,

In classified 21: the output url is: URL: http://127.0.0.1:9980/index.php?page=item&id=33164 The label is'http://localhost:9980/index.php?page=item&id=33164. I think they are exactaly the same thing.,

This is an interesting error but I'm not sure we can fix it because it uses the CLASSIFIEDS environment variable which can be set to anything (not just localhost). Unfortunately it might be a drawback of the evaluation pipeline at the moment, though I think it's possible to fix this by setting the environment variable to the 127.0.0.1 instead of localhost.

The agent is asked to type the titile as "Interesting Couch but the evalution is 'Interesting Couch by Blake Sullivan'

This is because the actual element contains "by AUTHOR_NAME" (you can check this by inspecting the .comments_list h3 element on that page. So this is correct.

leoozy commented 5 months ago

I only check 7 cases and do not know whether there are similar errors in other cases.

leoozy commented 5 months ago

This is because the actual element contains "by AUTHOR_NAME" (you can check this by inspecting the .comments_list h3 element on that page. So this is correct.

Thanks! It is correct.

leoozy commented 5 months ago

Maybe we can add a rule to parse the localhost->127.0.0.1?

leoozy commented 5 months ago

In the classifieds_142_gpt4v_som, the item the agent navigates to is wrong. But the location of the wrong item is exactly the same as the label.

kohjingyu commented 5 months ago

Yeah, there are a few false positives (like in classifieds task 142), but this is a limitation of the benchmark. I'd probably say that leaning towards false positives rather than false negatives is good since the performance of current agents are so bad.

leoozy commented 5 months ago

Thanks!

kohjingyu commented 5 months ago

Maybe we can add a rule to parse the localhost->127.0.0.1?

Thanks for your suggestion, we've added this in https://github.com/web-arena-x/visualwebarena/commit/a8a1648287c1d2b730e3645e016e2810b5a67195!