alias=`whoami | cut -d'.' -f2`; docker run -it --rm --runtime=nvidia --ipc=host --privileged -v /home/${alias}:/home/${alias} louis2889184/vilt:torch-1.10.2 bash
Please do everything in the /src
folder.
pip install -r requirements
pip install -e .
In order to perform sharded training, we need to install the following packages.
pip install fairscale==0.4.0
The pre-training data consists of four image captioning datasets: Conceptual Captions, SBU Captions, COCO and Visual Genome (VG) datasets. We convert the data to Apache arrow format following DATA.md
In codes, some parameter names are not directly the same to the paper, and they are
Methods | Checkpoints | |
---|---|---|
Modality-Specific Model | VL pre-trained / VQA fine-tuned / COCO fine-tuned | |
Modality-Agnostic Model | VL pre-trained / VQA fine-tuned / COCO fine-tuned | |
Our merged model | VL pre-trained / VQA fine-tuned / COCO fine-tuned |
Modality-agnostic pre-training -> Modality-agnostic fine-tuning
Modality-specific pre-training -> Modality-specific fine-tuning
Seed (modality-agnostic) pre-training -> Modality-specific pre-training -> Merge and do modality-agnostic fine-tuning
I use this beit weight as the pre-trained weight, but you can use this one instead and performance might be better.
Note that the seed pre-training in the paper only uses 100k steps, so please change it accordingly.
# run pre-training
log_dir=[The directory to store checkpoints and logs]
load_path=[Pre-trained weight (end with .ckpt)]
python run.py with num_gpus=8 num_nodes=6 task_mlm_itm_ifm_square_randaug_base_vl
exp_name=ma_200k_vlpt
whole_word_masking=True step200k per_gpu_batchsize=22 batch_size=1056
log_dir=${log_dir} load_path=${load_path} use_beit_weight=True
num_workers=16 use_sharded_training=True vl_mlm_prob=0.25 ufo
The pre-trained weight can be either the beit weight or the weight that trained from the seed pre-trainining.
# run pre-training
log_dir=[The directory to store checkpoints and logs]
load_path=[Pre-trained weight (end with .ckpt)]
python run.py with num_gpus=8 num_nodes=6 task_mlm_itm_ifm_square_randaug_base_vl
exp_name=ms_200k_vlpt
whole_word_masking=True step200k per_gpu_batchsize=22 batch_size=1056
log_dir=${log_dir} load_path=${load_path} use_beit_weight=True
num_workers=16 use_sharded_training=True all_moe vl_mlm_prob=0.25 use_vision_weights_for_other_modalities=True
use_sharded_training=True
: use sharded training# run fine-tuning
data_dir=[Path to COCO dataset]
log_dir=[The directory to store checkpoints and logs]
load_path=[Pre-trained weight (end with .ckpt)]
python run.py with data_root=${data_dir} num_gpus=8 num_nodes=4 task_finetune_irtr_coco_square_randaug_base_image384
exp_name=ma_coco_finetuning
per_gpu_batchsize=20 batch_size=640 learning_rate=6.25e-6
load_path=${load_path} log_dir=${log_dir} ufo
# run fine-tuning
data_dir=[Path to COCO dataset]
log_dir=[The directory to store checkpoints and logs]
load_path=[Pre-trained weight (end with .ckpt)]
python run.py with data_root=${data_dir} num_gpus=8 num_nodes=4 task_finetune_irtr_coco_square_randaug_base_image384
exp_name=ms_coco_finetuning
per_gpu_batchsize=20 batch_size=640 learning_rate=6.25e-6
load_path=${load_path} log_dir=${log_dir} all_moe
Remember to use modality-specific pre-trained weight, the code will merge the model and do fine-tuning.
# run interpolation
data_dir=[Path to COCO dataset]
log_dir=[The directory to store checkpoints and logs]
load_path=[Pre-trained weight (end with .ckpt)]
python run.py with data_root=${data_dir} num_gpus=8 num_nodes=4 task_finetune_irtr_coco_square_randaug_base_image384
exp_name=inter0.5_ms_coco_finetuning
per_gpu_batchsize=20 batch_size=640 learning_rate=6.25e-6
load_path=${load_path} log_dir=${log_dir} ufo merge_weights=True merge_ratio=0.5
# run modality arithmetic
# The `central weight` is like the origin for computing the modality vectors, and it is the seed pre-training weight in our case. `load_path` is the weight after VL pre-training.
data_dir=[Path to COCO dataset]
log_dir=[The directory to store checkpoints and logs]
load_path=[VL Pre-trained weight (end with .ckpt)]
central_weight=[The weight got from seed pre-training]
python run.py with data_root=${data_dir} num_gpus=8 num_nodes=4 task_finetune_irtr_coco_square_randaug_base_image384
exp_name=arithmetic0.75_ms_coco_finetuning
per_gpu_batchsize=20 batch_size=640 learning_rate=6.25e-6
load_path=${load_path} log_dir=${log_dir} central_weight=${central_weight} ufo sum_task_vectors=True sum_lambda=0.75
# run RegMean
data_dir=[Path to COCO dataset]
log_dir=[The directory to store checkpoints and logs]
load_path=[VL modality-specific pre-trained weight (end with .ckpt)]
gram_matrices=[Pre-extracted gram matrices (got from cache_gram_matrices.py) name example: gram_matrices]
# Compute the gram matrices
python cache_gram_matrices.py with data_root=${data_dir} num_gpus=1 num_nodes=1
task_finetune_irtr_coco_square_randaug_base_image384
exp_name=coco_ma_gram_matrices
per_gpu_batchsize=160 batch_size=160 image_size=224 load_path=${load_path}
log_dir=${log_dir}/ all_moe representation_name=${gram_matrices} get_recall_metric=False
python run.py with data_root=${data_dir} num_gpus=8 num_nodes=4 task_finetune_irtr_coco_square_randaug_base_image384
exp_name=inter0.5_ms_coco_finetuning
per_gpu_batchsize=20 batch_size=640 learning_rate=6.25e-6
load_path=${load_path} log_dir=${log_dir} ufo regmean=True scaling_for_non_diag=1.0
gram_matrices=${gram_matrices}.pth
# run fine-tuning
data_dir=[Path to VQA dataset]
log_dir=[The directory to store checkpoints and logs]
load_path=[VL modality-agnostic pre-trained weight (end with .ckpt)]
python run.py with data_root=${data_dir} num_gpus=8 num_nodes=4 task_finetune_vqa_square_randaug_base_image384_ufo
exp_name=ma_vqa_finetuning per_gpu_batchsize=4 batch_size=128 image_size=480 learning_rate=3e-5
load_path=${load_path} log_dir=${log_dir} drop_rate=0.15 max_epoch=10 ufo
# run inference
data_dir=[Path to VQA dataset]
log_dir=[The directory to store checkpoints and logs]
load_path=[VQA fine-tuned weight (end with .ckpt)]
python run.py with data_root=${data_dir} num_gpus=8 num_nodes=1 task_finetune_vqa_square_randaug_base_image384_ufo
exp_name=test
per_gpu_batchsize=32 batch_size=256 image_size=480 load_path=${load_path}
log_dir=${log_dir} ufo test_only=True
For NLVR2, please change the task in previous scripts to task_finetune_nlvr2_square_randaug_base_image384
, and updatet the batch size to per_gpu_batchsize=8 batch_size=128
, and the gpu usage to num_gpus=8 num_nodes=2
.
For Flickr30k, please change the task in previous scripts to task_finetune_irtr_f30k_square_randaug_base_image384
, and updatet the batch size and learning rates to per_gpu_batchsize=8 batch_size=128 learning_rate=6.25e-7
, and the gpu usage to num_gpus=8 num_nodes=2
.
Please consider to cite our work if you use the code for your projects.
@article{Sung2023AnEmpiricalSO,
title={An Empirical Study of Multimodal Model Merging},
author={Yi-Lin Sung and Linjie Li and Kevin Lin and Zhe Gan and Mohit Bansal and Lijuan Wang},
journal={Empirical Methods in Natural Language Processing (Findings)},
year={2023},
}