Visual Dialog for Spotting the Differences between Pairs of Similar Images

Directory Structure

|-- bottom-up-attention-vqa
|-- checkpoints
        |-- pretrained
                |-- bert-base-uncased
                |-- gpt2
                |-- model_LXRT.pth
        |-- ...
|-- data
        |-- 0206
                |-- spot_diff_train.json
                |-- ...
        |-- img_feat_3ee94.h5
|-- dataloader
        |-- guesser_dataloader.py
        |-- loader_utils.py
        |-- qgen_dataloader.py
|-- lxmert
        |-- ...
|-- model
        |-- guesser.py
        |-- qgen.py
|-- scripts
|-- stat_tools
|-- ...



Setup the environment by running pip install -r requirements.txt.

Pre-Trained Model

  1. BERT
  2. GPT-2
  3. LXMERT: could be download in https://github.com/airsplay/lxmert.

The pre-trained model should be put in checkpoints/pretrained.

SpotDiff Dataset

  1. SpotDiff dialogues: three JSON file, i.e., spot_diff_train.json, spot_diff_val.json, spot_diff_test.json. You could download these files from Baidu Netdisk.
  2. SpotDiff images
    • You could download the original images from my Baidu Netdisk.
    • Due to the large size of images, I compressed it into four files. You should download these files to your local device and then proceed to merge and decompress them.
    • Considering the original image collection is too large, you can only use a subset of it.
  3. Image features: are extrated by bottom-up top-down attention. The extracted features could be downloaded here. We extracted butd features by running the code bottom-up-attention.pytorch.


require to modify and in the following scripts.


GPT and LXMERT-based VQG model

sh scripts/train_<vqg_model_type>_vqg.sh


BUTD and LXMERT-based VQA model

sh scripts/train_<vqa_model_type>_vqa.sh


sh scripts/train_guesser.sh


sh scripts/self_play_{vqg_model_type}_{vqa_model_type}.sh