ylsung / vl-merging

PyTorch codes for the paper "An Empirical Study of Multimodal Model Merging"
36 stars 0 forks source link

VL Model Merging

1682897994306

Updates

Environment

alias=`whoami | cut -d'.' -f2`; docker run -it --rm --runtime=nvidia --ipc=host --privileged -v /home/${alias}:/home/${alias} louis2889184/vilt:torch-1.10.2 bash

Please do everything in the /src folder.

pip install -r requirements
pip install -e .

In order to perform sharded training, we need to install the following packages.

pip install fairscale==0.4.0

Pre-training Data

The pre-training data consists of four image captioning datasets: Conceptual Captions, SBU Captions, COCO and Visual Genome (VG) datasets. We convert the data to Apache arrow format following DATA.md

Teminology

In codes, some parameter names are not directly the same to the paper, and they are

Checkpoints

Methods Checkpoints
Modality-Specific Model VL pre-trained / VQA fine-tuned / COCO fine-tuned
Modality-Agnostic Model VL pre-trained / VQA fine-tuned / COCO fine-tuned
Our merged model VL pre-trained / VQA fine-tuned / COCO fine-tuned

Evaluation Pipeline

Modality-agnostic pre-training -> Modality-agnostic fine-tuning

Modality-specific pre-training -> Modality-specific fine-tuning

Seed (modality-agnostic) pre-training -> Modality-specific pre-training -> Merge and do modality-agnostic fine-tuning

Pre-training

I use this beit weight as the pre-trained weight, but you can use this one instead and performance might be better.

Seed (modalty-agnostic) pre-training

Note that the seed pre-training in the paper only uses 100k steps, so please change it accordingly.

# run pre-training

log_dir=[The directory to store checkpoints and logs]
load_path=[Pre-trained weight (end with .ckpt)]

python run.py with num_gpus=8 num_nodes=6 task_mlm_itm_ifm_square_randaug_base_vl
    exp_name=ma_200k_vlpt
    whole_word_masking=True step200k per_gpu_batchsize=22 batch_size=1056
    log_dir=${log_dir} load_path=${load_path} use_beit_weight=True 
    num_workers=16 use_sharded_training=True vl_mlm_prob=0.25 ufo

Modalty-specific pre-training

The pre-trained weight can be either the beit weight or the weight that trained from the seed pre-trainining.

# run pre-training

log_dir=[The directory to store checkpoints and logs]
load_path=[Pre-trained weight (end with .ckpt)]

python run.py with num_gpus=8 num_nodes=6 task_mlm_itm_ifm_square_randaug_base_vl
    exp_name=ms_200k_vlpt
    whole_word_masking=True step200k per_gpu_batchsize=22 batch_size=1056
    log_dir=${log_dir} load_path=${load_path} use_beit_weight=True 
    num_workers=16 use_sharded_training=True all_moe vl_mlm_prob=0.25 use_vision_weights_for_other_modalities=True

Fine-tuning on COCO

# run fine-tuning

data_dir=[Path to COCO dataset]
log_dir=[The directory to store checkpoints and logs]
load_path=[Pre-trained weight (end with .ckpt)]

python run.py with data_root=${data_dir} num_gpus=8 num_nodes=4 task_finetune_irtr_coco_square_randaug_base_image384 
    exp_name=ma_coco_finetuning
    per_gpu_batchsize=20 batch_size=640 learning_rate=6.25e-6
    load_path=${load_path} log_dir=${log_dir} ufo
# run fine-tuning

data_dir=[Path to COCO dataset]
log_dir=[The directory to store checkpoints and logs]
load_path=[Pre-trained weight (end with .ckpt)]

python run.py with data_root=${data_dir} num_gpus=8 num_nodes=4 task_finetune_irtr_coco_square_randaug_base_image384 
    exp_name=ms_coco_finetuning
    per_gpu_batchsize=20 batch_size=640 learning_rate=6.25e-6
    load_path=${load_path} log_dir=${log_dir} all_moe

Remember to use modality-specific pre-trained weight, the code will merge the model and do fine-tuning.

# run interpolation

data_dir=[Path to COCO dataset]
log_dir=[The directory to store checkpoints and logs]
load_path=[Pre-trained weight (end with .ckpt)]

python run.py with data_root=${data_dir} num_gpus=8 num_nodes=4 task_finetune_irtr_coco_square_randaug_base_image384 
    exp_name=inter0.5_ms_coco_finetuning
    per_gpu_batchsize=20 batch_size=640 learning_rate=6.25e-6
    load_path=${load_path} log_dir=${log_dir} ufo merge_weights=True merge_ratio=0.5
# run modality arithmetic

# The `central weight` is like the origin for computing the modality vectors, and it is the seed pre-training weight in our case. `load_path` is the weight after VL pre-training.

data_dir=[Path to COCO dataset]
log_dir=[The directory to store checkpoints and logs]
load_path=[VL Pre-trained weight (end with .ckpt)]
central_weight=[The weight got from seed pre-training]

python run.py with data_root=${data_dir} num_gpus=8 num_nodes=4 task_finetune_irtr_coco_square_randaug_base_image384 
    exp_name=arithmetic0.75_ms_coco_finetuning
    per_gpu_batchsize=20 batch_size=640 learning_rate=6.25e-6
    load_path=${load_path} log_dir=${log_dir} central_weight=${central_weight} ufo sum_task_vectors=True sum_lambda=0.75 
# run RegMean

data_dir=[Path to COCO dataset]
log_dir=[The directory to store checkpoints and logs]
load_path=[VL modality-specific pre-trained weight (end with .ckpt)]
gram_matrices=[Pre-extracted gram matrices (got from cache_gram_matrices.py) name example: gram_matrices]

# Compute the gram matrices
python cache_gram_matrices.py with data_root=${data_dir} num_gpus=1 num_nodes=1
    task_finetune_irtr_coco_square_randaug_base_image384 
    exp_name=coco_ma_gram_matrices
    per_gpu_batchsize=160 batch_size=160 image_size=224 load_path=${load_path}
    log_dir=${log_dir}/ all_moe representation_name=${gram_matrices} get_recall_metric=False

python run.py with data_root=${data_dir} num_gpus=8 num_nodes=4 task_finetune_irtr_coco_square_randaug_base_image384 
    exp_name=inter0.5_ms_coco_finetuning
    per_gpu_batchsize=20 batch_size=640 learning_rate=6.25e-6
    load_path=${load_path} log_dir=${log_dir} ufo regmean=True scaling_for_non_diag=1.0 
    gram_matrices=${gram_matrices}.pth

Fine-tuning on VQA (480x480image)

# run fine-tuning

data_dir=[Path to VQA dataset]
log_dir=[The directory to store checkpoints and logs]
load_path=[VL modality-agnostic pre-trained weight (end with .ckpt)]

python run.py with data_root=${data_dir} num_gpus=8 num_nodes=4 task_finetune_vqa_square_randaug_base_image384_ufo 
    exp_name=ma_vqa_finetuning per_gpu_batchsize=4 batch_size=128 image_size=480 learning_rate=3e-5
    load_path=${load_path} log_dir=${log_dir} drop_rate=0.15 max_epoch=10 ufo

Inference on VQA (480x480image)

# run inference

data_dir=[Path to VQA dataset]
log_dir=[The directory to store checkpoints and logs]
load_path=[VQA fine-tuned weight (end with .ckpt)]

python run.py with data_root=${data_dir} num_gpus=8 num_nodes=1 task_finetune_vqa_square_randaug_base_image384_ufo 
    exp_name=test
    per_gpu_batchsize=32 batch_size=256 image_size=480 load_path=${load_path} 
    log_dir=${log_dir} ufo test_only=True

Fine-tuning on NLVR2 and Flickr30k

Reference

Please consider to cite our work if you use the code for your projects.

@article{Sung2023AnEmpiricalSO,
  title={An Empirical Study of Multimodal Model Merging},
  author={Yi-Lin Sung and Linjie Li and Kevin Lin and Zhe Gan and Mohit Bansal and Lijuan Wang},
  journal={Empirical Methods in Natural Language Processing (Findings)},
  year={2023},
}