PyTroch implementation of our ACL'23 paper:
MultiCapCLIP: Auto-Encoding Prompts for Zero-Shot Multilingual Visual Captioning
Bang Yang, Fenglin Liu, Xian Wu, Yaowei Wang, Xu Sun, and Yuexian Zou
[2023-10-22] Update links for downloading raw images and videos in README_DATA.md.
[2023-08-28] We release the code and data.
We run the code based on Python
3.8.8, torch
1.13.1, and cuda
11.7. Please change the version of torch and cuda according to your hardwares.
git clone https://github.com/yangbang18/MultiCapCLIP.git
cd MultiCapCLIP
conda create -n zerovc python==3.8.8
conda activate zerovc
# Install a proper version of torch, e.g.:
pip install torch==1.13.1+cu117 torchvision==0.14.1+cu117 torchaudio==0.13.1+cu117 -f https://download.pytorch.org/whl/cu117/torch_stable.html
pip install -r requirement.txt
Please prepare the data following README_DATA.md.
All experiments can be conducted with scripts/pipe.sh.
train_dataset="msrvtt"
method="baseline"
tasks="adapt adapt_zeroshot"
val_datasets="coco msrvtt"
bash scripts/pipe.sh $train_dataset $method "$tasks" "$val_datasets"
In the above command, we make the baseline model auto-encode on MSR-VTT's training captions first (adapt
), and then evaluate on MS-COCO (out-of-domain) and MSR-VTT (in-donmain) (adapt_zeroshot
).
Key Arguments:
train_dataset
supports one of [coco
, msrvtt
, vatex
, flickr30k
].method
is defined in configs/methods.yaml.tasks
can be combinations of:
finetune
: fully-supervised training, where a model will be trained on 100% training captions.finetune_fewshot
: fully-supervised training, where a model will be trained on 0.01% (if applicable), 0.1%, 1%, 10% training captions.adapt
: text-only autoencoding. This task is only responsible for training.adapt_zeroshot
: evaluate the model that has done adapt
on a specific dataset.adapt_fewshot
: use the model that has done adapt
as a starting point, train the model on 0.01% (if applicable), 0.1%, 1%, 10% training captions. This task is equivalent to semi-supervised learning in the paper.val_datasets
can be combinations of [coco
, msrvtt
, vatex
, flickr30k
].X
)For the msrvtt
and vatex
datasets, X
denotes Chinese (zh
). For the flickr30k
dataset, X
can be German (de
) and French (fr
).
train_dataset="flickr30k"
method="baseline#de-de"
tasks="adapt adapt_zeroshot"
val_datasets="flickr30k"
bash scripts/pipe.sh $train_dataset $method "$tasks" "$val_datasets"
Different from the monolingual command, we append a postfix #A-B
to the method:
#
activates the multilingual mode, where we use bert-base-multilingual-cased
's vocab rather than that of bert-base-uncased
to embed tokens. See configs for more details.A
denotes which language(s) to be generated during training. For example, when we train model on English-German pairs, we can set A
to de
(the model only uses German texts as targets) or en,de
(the model uses both English and German texts as targets).B
denotes which language to be generated during evaluation. For example, #A-de
means we require generating German texts during evaluation.You can run the following command to gather results, where mean metric scores with their standard deviation across a number of runs are reported.
python show.py --root output/finetune --csv_path results/ --csv_fn finetune.csv
python show.py --root output/adapt --csv_path results/ --csv_fn adapt.csv
T-SNE Visualization: See notebooks/zero_shot_tsne.ipynb for an example.
Please [★star] this repo and [cite] the following papers if you feel our code or data useful to your research:
@inproceedings{yang-etal-2023-multicapclip,
title = "{M}ulti{C}ap{CLIP}: Auto-Encoding Prompts for Zero-Shot Multilingual Visual Captioning",
author = "Yang, Bang and Liu, Fenglin and Wu, Xian and Wang, Yaowei and Sun, Xu and Zou, Yuexian",
booktitle = "Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
year = "2023",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2023.acl-long.664",
doi = "10.18653/v1/2023.acl-long.664",
pages = "11908--11922",
}
@ARTICLE{Yang2024ZeroNLG,
author={Yang, Bang and Liu, Fenglin and Zou, Yuexian and Wu, Xian and Wang, Yaowei and Clifton, David A.},
journal={IEEE Transactions on Pattern Analysis and Machine Intelligence},
title={ZeroNLG: Aligning and Autoencoding Domains for Zero-Shot Multimodal and Multilingual Natural Language Generation},
year={2024},
volume={46},
number={8},
pages={5712-5724},
doi={10.1109/TPAMI.2024.3371376}
}