Code for our AAAI 2021 paper Confidence-aware Non-repetitive Multimodal Transformers for TextCaps [PDF].
Our implementation is based on Pythia framework (now called mmf), and built upon M4C-Captioner. Please refer to Pythia's document for more details on installation requirements.
# install pythia based on requirements.txt
python setup.py build develop
The following is open-source data of TextCaps dataset from M4C-Captioner's Github repository. Please download them from the links below and and extract them under data
directory.
Our imdb
files include new OCR tokens and recognition confidence extracted with pretrained OCR systems ( CRAFT, ABCNet and four-stage STR). The three imdb files should be downloaded from the links below and put under data/imdb/
.
file name | download link |
---|---|
imdb_train.npy | Google Drive Baidu Netdisk(password: sxbk) |
imdb_val_filtered_by_image_id.npy | Google Drive Baidu Netdisk(password: i6pf) |
imdb_test_filtered_by_image_id.npy | Google Drive Baidu Netdisk(password: uxew) |
Finally, your data
directory structure should look like this:
data
|-detectron
|---...
|-m4c_textvqa_ocr_en_frcn_features
|---...
|-open_images
|---...
|-vocab_textcap_threshold_10.txt #already provided
|-imdb
|---imdb_train.npy
|---imdb_val_filtered_by_image_id.npy
|---imdb_test_filtered_by_image_id.npy
download link | description | val set CIDEr | test set CIDEr |
---|---|---|---|
[Google Drive](https://drive.google.com/file/d/1VfdvR12fPKNJnljjzSZ9lMIPw1Foa4WF/view?usp=sharing()) Baidu Netdisk(password: c4be) | CNMT best | 101.6 | 93.0 |
We provide an example script for training on TextCaps dataset for 12000 iterations and evaluating every 500 iterations.
./train.sh
This may take approximately 13 hours, depending on GPU devices. Please refer to our paper for implementation details.
First-time training will download fasttext
model . You may also download it manually and put it under pythia/.vector_cache/
.
During training, log file can be found under save/cnmt/m4c_textcaps_cnmt/logs/
. You may also run training in background and check log file for training status.
Assume that checkpoint of the trained model is saved at save/cnmt/m4c_textcaps_cnmt/best.ckpt
(otherwise modify the resume_file
parameter in the shell script).
Run the following script to generate prediction json file:
#evaluate on validation set
./eval_val.sh
#evaluate on test set
./eval_test.sh
The prediction json file will be saved under save/eval/m4c_textcaps_cnmt/reports/
. You can submit the json file to the TextCaps EvalAI server for result.
@article{wang2020confidenceaware,
title={Confidence-aware Non-repetitive Multimodal Transformers for TextCaps},
author={Wang, Zhaokai and Bao, Renda and Wu, Qi and Liu, Si},
year={2020},
journal={arXiv preprint arXiv:2012.03662},
}