web-arena-x / visualwebarena

VisualWebArena is a benchmark for multimodal agents.
https://jykoh.com/vwa
MIT License
211 stars 37 forks source link

Blank screenshot when running GPT-4V + SOM #19

Closed ltzheng closed 5 months ago

ltzheng commented 5 months ago

The screenshot seems problematic when I run the GPT-4V + SoM agent with the following flags:

python run.py \
  --instruction_path agent/prompts/jsons/p_som_cot_id_actree_3s.json \
  --test_start_idx 0 \
  --test_end_idx 1 \
  --result_dir <your_result_dir> \
  --test_config_base_dir=config_files/test_shopping \
  --model gpt-4-vision-preview \
  --action_set_tag som  --observation_type image_som

Here is part of the render_0.html:

image

The GPT response also shows that the image sent was empty.

kohjingyu commented 5 months ago

Did you set the appropriate ENV variables and run scripts/generate_test_data.py (see instructions)? I'd double check what the 0.json config file looks like, and whether you can open the URL from your own web browser.