Closed leoozy closed 5 months ago
I found many similar cases to classifieds 10. Maybe 10 inches can also be seen as a correct answer.
In classifieds 12, the image is: The task is: What is the color of most recently listed motorcycle? The answer of GPT4-V is: red and black, The label is "red". I think the GPT4-V is correct
In classified 21: the output url is: URL: http://127.0.0.1:9980/index.php?page=item&id=33164 The label is'http://localhost:9980/index.php?page=item&id=33164. I think they are exactaly the same thing.,
sites: ['classifieds'] task_id: 30 require_login: True storage_state: ./.auth/classifieds_state.json start_url: http://localhost:9980 geolocation: None intent_template: Add a comment on the {{item}} with the title "{{title}}" and text "{{comment}}". intent: Add a comment on the most expensive black couch with the title "Interesting Couch" and text "Is the price negotiable?". image: None instantiation_dict: {'item': 'most expensive black couch', 'title': 'Interesting Couch', 'comment': 'Is the price negotiable?'} require_reset: True eval: {'eval_types': ['program_html'], 'reference_answers': None, 'reference_url': '', 'program_html': [{'url': 'http://localhost:9980/index.php?page=item&id=44542', 'locator': "func:get_query_text(page, '.comments_list h3')", 'required_contents': {'must_include': ['Interesting Couch by Blake Sullivan']}}, {'url': 'http://localhost:9980/index.php?page=item&id=44542', 'locator': "func:get_query_text(page, '.comments_list')", 'required_contents': {'must_include': ['Is the price negotiable?']}}]} reasoning_difficulty: hard visual_difficulty: easy overall_difficulty: medium comments: http://localhost:9980/index.php?page=item&id=44542 intent_template_id: 11
The agent is asked to type the titile as "Interesting Couch but the evalution is 'Interesting Couch by Blake Sullivan'
Thanks for pointing these out! I've updated the exact_match
cases to use must_include
instead to avoid these false negatives in https://github.com/web-arena-x/visualwebarena/commit/c3b7b7d1bce29d58ff89f87e6f4b7a1bbced8843. For the other two,
In classified 21: the output url is: URL: http://127.0.0.1:9980/index.php?page=item&id=33164 The label is'http://localhost:9980/index.php?page=item&id=33164. I think they are exactaly the same thing.,
This is an interesting error but I'm not sure we can fix it because it uses the CLASSIFIEDS
environment variable which can be set to anything (not just localhost). Unfortunately it might be a drawback of the evaluation pipeline at the moment, though I think it's possible to fix this by setting the environment variable to the 127.0.0.1 instead of localhost.
The agent is asked to type the titile as "Interesting Couch but the evalution is 'Interesting Couch by Blake Sullivan'
This is because the actual element contains "by AUTHOR_NAME" (you can check this by inspecting the .comments_list h3
element on that page. So this is correct.
I only check 7 cases and do not know whether there are similar errors in other cases.
This is because the actual element contains "by AUTHOR_NAME" (you can check this by inspecting the
.comments_list h3
element on that page. So this is correct.
Thanks! It is correct.
Maybe we can add a rule to parse the localhost->127.0.0.1?
In the classifieds_142_gpt4v_som, the item the agent navigates to is wrong. But the location of the wrong item is exactly the same as the label.
Yeah, there are a few false positives (like in classifieds task 142), but this is a limitation of the benchmark. I'd probably say that leaning towards false positives rather than false negatives is good since the performance of current agents are so bad.
Thanks!
Maybe we can add a rule to parse the localhost->127.0.0.1?
Thanks for your suggestion, we've added this in https://github.com/web-arena-x/visualwebarena/commit/a8a1648287c1d2b730e3645e016e2810b5a67195!
I found some errors in annotion. In the classifieds_10: sites: ['classifieds'] task_id: 10 require_login: True storage_state: ./.auth/classifieds_state.json start_url: http://localhost:9980 geolocation: None intent_template: What is the {{attribute}} of {{item}}? intent: What is the seat height in inches of the smaller piece of furniture on this page? image: None instantiation_dict: {'attribute': 'seat height in inches', 'item': 'the smaller piece of furniture on this page'} require_reset: False eval: {'eval_types': ['string_match'], 'reference_answers': {'exact_match': '21'}, 'reference_url': 'http://localhost:9980/index.php?page=item&id=43887', 'program_html': [], 'string_note': '', 'reference_answer_raw_annotation': ''} reasoning_difficulty: easy visual_difficulty: easy overall_difficulty: easy comments: intent_template_id: 5
The output is 21 inches, which I think is correct.
In classified 142, the agent found wrong things in the trace of GPT4V. But it is evaluated as correct.