yekeren / Cap2Det

Implementation of our ICCV 2019 paper "Cap2Det: Learning to AmplifyWeak Caption Supervision for Object Detection"
Apache License 2.0
29 stars 9 forks source link

Please refer to https://github.com/yekeren/Cap2Det/issues/18 for the details regarding the pre-trained classification model, and the tf variable matching issues. (This issue should be resolved by checking out the forked cap2det branch in "install-env.sh")

Cap2Det

Introduction

Implementation of our ICCV 2019 paper "Cap2Det: Learning to Amplify Weak Caption Supervision for Object Detection".

TL;DR

Here are the simplest commands for preparing the data and training the models. One can skip all the rest of the contents in this document.

sh dataset-tools/download_and_preprocess_voc.sh "raw-data-voc"
sh dataset-tools/download_and_preprocess_mscoco.sh "raw-data-coco"
sh dataset-tools/download_and_preprocess_flickr30k.sh "raw-data-flickr30k/"

# Train a text classifier.
sh train_text.sh "coco17_text"

# Train a Cap2Det model.
sh train_cap2det.sh "coco17_extend_match"

# Train a WSOD model.
sh train_wsod.sh "voc07_groundtruth"

Please read the following details regarding the usage.

Installation

We use Python 3.6.4 and Tensorflow 1.10.1. More details regarding the required packages can be found in requirements.txt.

To install the packages using the default setting, one can use pip:

pip install -r "requirements.txt"

sh install-env.sh

Preparing data

We provide scripts to preprocess datasets such as Pascal VOC 2007/2012, MSCOCO 2017, Flick30K, Image Ads.

Preparing these datasets involves three steps:

Pascal VOC

The Pascal VOC datasets are used for:

The datasets do not have captions annotations.

For the first goal (WSOD), we tested our models on both VOC2007 and VOC2012. we train on 5,011 and 11,540 trainval images respectively, We evaluate on 4,952 and 10,991 test images.

For the second goal (Cap2Det), we train models on MSCOCO or Flickr30k, then evaluate on the 4,952 test images in VOC2007.

python "dataset-tools/create_pascal_selective_search_data.py" \
  --logtostderr \
  --data_dir="${DATA_DIR}" \
  --year="${YEAR}" \
  --set="${SET}" \
  --output_dir="${OUTPUT_DIR}"

python "dataset-tools/create_pascal_tf_record.py" \
  --logtostderr \
  --data_dir="${DATA_DIR}" \
  --year="${YEAR}" \
  --set="${SET}" \
  --output_path="${OUTPUT_PATH}" \
  --label_map_path="${LABEL_MAP_PATH}" \
  --proposal_data_path="${PROPOSAL_DATA_PATH}" \
  --ignore_difficult_instances

Putting all together, one can just run the following all-in-one command. It shall create a new raw-data-voc directory, and generate files in it.

sh dataset-tools/download_and_preprocess_voc.sh "raw-data-voc"

MSCOCO 2017

We use the 591,435 annotated captions paired to the 118,287 train2017 images for training our Cap2Det model. The evaluation is proceeded on either the MSCOCO test images or the 4,952 VOC2007 images.

python "dataset-tools/create_coco_selective_search_data.py" \
  --logtostderr \
  --train_image_file="${TRAIN_IMAGE_FILE}" \
  --val_image_file="${VAL_IMAGE_FILE}" \
  --test_image_file="${TEST_IMAGE_FILE}" \
  --train_annotations_file="${TRAIN_ANNOTATIONS_FILE}" \
  --val_annotations_file="${VAL_ANNOTATIONS_FILE}" \
  --testdev_annotations_file="${TESTDEV_ANNOTATIONS_FILE}" \
  --output_dir="${OUTPUT_DIR}"

python "dataset-tools/create_coco_tf_record.py" \
  --logtostderr \
  --train_image_file="${TRAIN_IMAGE_FILE}" \
  --val_image_file="${VAL_IMAGE_FILE}" \
  --test_image_file="${TEST_IMAGE_FILE}" \
  --train_annotations_file="${TRAIN_ANNOTATIONS_FILE}" \
  --train_caption_annotations_file="${TRAIN_CAPTION_ANNOTATIONS_FILE}" \
  --val_annotations_file="${VAL_ANNOTATIONS_FILE}" \
  --val_caption_annotations_file="${VAL_CAPTION_ANNOTATIONS_FILE}" \
  --testdev_annotations_file="${TESTDEV_ANNOTATIONS_FILE}" \
  --proposal_data="${PROPOSAL_DATA}" \
  --output_dir="${OUTPUT_DIR}"

python "dataset-tools/create_coco_vocab.py" \
  --logtostderr \
  --train_caption_annotations_file="${TRAIN_CAPTION_ANNOTATIONS_FILE}" \
  --glove_file="${GLOVE_FILE}" \
  --output_vocabulary_file="${OUTPUT_VOCABULARY_FILE}" \
  --output_vocabulary_word_embedding_file="${OUTPUT_VOCABULARY_WORD_EMBEDDING_FILE}" \
  --min_word_freq=${MIN_WORD_FREQ}

Putting all together, one can just run the following all-in-one command. It shall create a new raw-data-coco directory, and generate files in it.

sh dataset-tools/download_and_preprocess_mscoco.sh "raw-data-coco/"

Flickr30K

We also trained a Cap2Det model on the FLickr30K dataset containing 31,783 images and 158,915 descriptive captions.

python "dataset-tools/create_flickr30k_selective_search_data.py" \
  --logtostderr \
  --image_tar_file=${IMAGE_TAR_FILE} \
  --output_dir=${OUTPUT_DIR}

python "create_flickr30k_tf_record.py" \
  --logtostderr \
  --image_tar_file="${IMAGE_TAR_FILE}" \
  --proposal_data_path="${PROPOSAL_DATA_PATH}" \
  --annotation_path="${ANNOTATION_PATH}" \
  --output_path="${OUTPUT_PATH}"

python "create_flickr30k_vocab.py" \
  --logtostderr \
  --annotation_path="${ANNOTATION_PATH}" \
  --glove_file="${GLOVE_FILE}" \
  --output_vocabulary_file="${OUTPUT_VOCABULARY_FILE}"
  --output_vocabulary_word_embedding_file="${OUTPUT_VOCABULARY_WORD_EMBEDDING_FILE}"

Putting all together, one can just run the following all-in-one command. It shall create a new raw-data-flickr30k directory, and generate files in it.

sh dataset-tools/download_and_preprocess_flickr30k.sh "raw-data-flickr30k/"

Training

Pre-training of a text model

The following command shall launch a process to train the text classifier, which is a 3-layers perceptron model.

sh train_text.sh "coco17_text"

Cap2Det training

The difference between the Weakly Supervised Object Detection (WSOD) and Caption-to-Detection (Cap2Det) models, lies in the way of extracting the labels.

We defined abstract LabelExtractor class to control the behavior of label extractors. The following tables show how to set the configure to reproduce the methods in the paper.

Name Alternative methods in the paper Configure files
GroundtruthExtractor GT-Label (WSOD) coco17_groundtruth, voc07_groundtruth
ExactMatchExtractor ExactMatch (EM) coco17_exact_match
ExtendMatchExtractor EM+ExtendVocab coco17_extend_match
WordVectorMatchExtractor EM+GloVePseudo, EM+LearnedGloVe coco17_word_vector_match
TextClassifierMatchExtractor EM+TextClsf coco17_text_classifier_match

The command to launch the training process is

sh train_cap2det.sh "[CONFIG_NAME]"

# or

sh train_wsod.sh "[CONFIG_NAME]"

Where the [CONFIG_NAME] can be one of the file names in the configs directory.

More interested in the WSOD ?

If your are more interested in the WSOD task, the following new config (which uses 2000 proposals and a batch size of 1) improved the VOC07 results in the paper by 2%. One can also refer to the config file to improve the Cap2Det performance.

Methods aero bike bird boat bottle bus car cat chair cow table dog horse mbike person plant sheep sofa train tv mean
In-the-paper 68.7 49.7 53.3 27.6 14.1 64.3 58.1 76.0 23.6 59.8 50.7 57.4 48.1 63.0 15.5 18.4 49.7 55.0 48.4 67.8 48.5
New-config 64.9 55.4 59.5 25.0 22.6 71.9 71.3 61.7 26.7 54.8 52.5 50.3 45.4 63.1 22.8 26.4 45.4 61.0 66.3 66.2 50.7

Our paper

If you found this repository useful, please cite our paper

@InProceedings{Ye_2019_ICCV,
  author = {Ye, Keren and Zhang, Mingda and Kovashka, Adriana and Li, Wei and Qin, Danfeng and Berent, Jesse},
  title = {Cap2Det: Learning to Amplify Weak Caption Supervision for Object Detection},
  booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
  month = {October},
  year = {2019}
}

Also, please take a look at our latest work that extracts scene graphs from captions

@InProceedings{Ye_2021_CVPR,
  author = {Ye, Keren and Kovashka, Adriana},
  title = {Linguistic Structures as Weak Supervision for Visual Scene Graph Generation},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2021}
}