wzk1015/CNMT - Githubissues

Introduction

Code for our AAAI 2021 paper Confidence-aware Non-repetitive Multimodal Transformers for TextCaps [PDF].

Installation

Our implementation is based on Pythia framework (now called mmf), and built upon M4C-Captioner. Please refer to Pythia's document for more details on installation requirements.

# install pythia based on requirements.txt
python setup.py build develop

Data Preparation

The following is open-source data of TextCaps dataset from M4C-Captioner's Github repository. Please download them from the links below and and extract them under data directory.

Our imdb files include new OCR tokens and recognition confidence extracted with pretrained OCR systems ( CRAFT, ABCNet and four-stage STR). The three imdb files should be downloaded from the links below and put under data/imdb/.

file name	download link
imdb_train.npy	Google Drive Baidu Netdisk(password: sxbk)
imdb_val_filtered_by_image_id.npy	Google Drive Baidu Netdisk(password: i6pf)
imdb_test_filtered_by_image_id.npy	Google Drive Baidu Netdisk(password: uxew)

Finally, your data directory structure should look like this:

data
|-detectron                         
|---...
|-m4c_textvqa_ocr_en_frcn_features
|---...
|-open_images                       
|---...
|-vocab_textcap_threshold_10.txt    #already provided
|-imdb                              
|---imdb_train.npy                  
|---imdb_val_filtered_by_image_id.npy   
|---imdb_test_filtered_by_image_id.npy

Pretrained Model

download link	description	val set CIDEr	test set CIDEr
[Google Drive](https://drive.google.com/file/d/1VfdvR12fPKNJnljjzSZ9lMIPw1Foa4WF/view?usp=sharing()) Baidu Netdisk(password: c4be)	CNMT best	101.6	93.0

Training

We provide an example script for training on TextCaps dataset for 12000 iterations and evaluating every 500 iterations.

./train.sh

This may take approximately 13 hours, depending on GPU devices. Please refer to our paper for implementation details.

First-time training will download fasttext model . You may also download it manually and put it under pythia/.vector_cache/.

During training, log file can be found under save/cnmt/m4c_textcaps_cnmt/logs/. You may also run training in background and check log file for training status.

Evaluation

Assume that checkpoint of the trained model is saved at save/cnmt/m4c_textcaps_cnmt/best.ckpt (otherwise modify the resume_file parameter in the shell script).

Run the following script to generate prediction json file:

#evaluate on validation set
./eval_val.sh 
#evaluate on test set
./eval_test.sh

The prediction json file will be saved under save/eval/m4c_textcaps_cnmt/reports/. You can submit the json file to the TextCaps EvalAI server for result.

Citation

@article{wang2020confidenceaware,
  title={Confidence-aware Non-repetitive Multimodal Transformers for TextCaps}, 
  author={Wang, Zhaokai and Bao, Renda and Wu, Qi and Liu, Si},
  year={2020},
  journal={arXiv preprint arXiv:2012.03662},
}

wzk1015 / CNMT

readme