zengyan-97 / X2-VLM

All-In-One VLM: Image + Video + Transfer to Other Languages / Domains (TPAMI 2023)
BSD 3-Clause "New" or "Revised" License
144 stars 13 forks source link
vision-and-language

X2-VLM: All-In-One Pre-trained Model For Vision-Language Tasks

X2-VLM with a modular architecture performs the best on base and large scale for both image-text and video-text tasks, making a good trade-off between performance and model scale. We also show that the modular design of X2-VLM results in high transferability for X2-VLM to be utilized in any language or domain. For example, by simply replacing the text encoder with XLM-R, X-VLM outperforms state-of-the-art multilingual multi-modal pre-trained models without any multilingual pre-training.

X2-VLM (large, 593M params): PWC PWC PWC PWC PWC

Features

Please read the code for more details.

Requirements

Pretrain

# X-VLM pretrain 
python3 run.py --task "pretrain_DIY" --dist "all" --config "configs/pretrain/x2vlm_base_4m.yaml"  --output_dir "output/tmp"

# CCLM multilingual multimodal pretrain 
python3 run.py --task "pretrain_DIY" --dist "all" --config "configs/pretrain/multilingual_cclm_x2vlm_base.yaml" --checkpoint "path/to/x2vlm_base_1b.th"  --output_dir "output/tmp"

See run.py and configs/pretrain for more details.

Data

All datasets we utilized are public available. Please prepare the pre-training data by yourself. Read the code dataset/pretrain_dataset.py (more specifically ImageTextJsonDataset & RegionTextJsonDataset) to see what format is needed.

The processed COCO & VG annotations can be downloaded here.

Checkpoints

Please make sure all parameters are loaded correctly.
X2VLM-base (4M)
X2VLM-large (4M)
X2VLM-base (1B)
CCLM-X2VLM-base

Finetune

Data

All datasets are publicly available. Some datasets can be downloaded here.

Checkpoints, Configs and Logs

We have released all codes. However, now we only provide parts of fine-tuned ckpts (and training configs and logs).
vqa-base
vqa-large
refcoco-bbox-large
It takes time for us to retrieve our previous training logs. If you need more, please submit a Github issue and we will return to your request later.
coco-retrieval-base-rerun
coco-retrieval-large-rerun

Examples

# train

python3 run.py --task "vqa" --dist "all" --config "configs/finetune/vqa2_large.yaml" --checkpoint "x2vlm_ckpts_2release/x2vlm_large_1b.th"  --output_dir "output/tmp"

python3 run.py --task "refcoco_bbox" --dist "all" --config "configs/finetune/refcoco_grounding_large.yaml" --checkpoint "x2vlm_ckpts_2release/x2vlm_large_1b.th"  --output_dir "output/tmp"

python3 run.py --task "coco_captioning_mlm" --dist "all" --config "configs/finetune/coco_captioning_large.yaml" --checkpoint "x2vlm_ckpts_2release/x2vlm_large_1b.th"  --output_dir "output/tmp"

We release all training codes. Specify "--task" and "--config" to finetune on other tasks. See run.py for details.

Citation

If you find this repository useful, please considering giving ⭐ or citing:

@article{zeng2022x,
  title={X $\^{} 2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks},
  author={Zeng, Yan and Zhang, Xinsong and Li, Hang and Wang, Jiawei and Zhang, Jipeng and Zhou, Wangchunshu},
  journal={arXiv preprint arXiv:2211.12402},
  year={2022}
}

@article{zeng2022cross,
  title={Cross-view language modeling: Towards unified cross-lingual cross-modal pre-training},
  author={Zeng, Yan and Zhou, Wangchunshu and Luo, Ao and Zhang, Xinsong},
  journal={arXiv preprint arXiv:2206.00621},
  year={2022}
}

@article{xvlm,
  title={Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts},
  author={Zeng, Yan and Zhang, Xinsong and Li, Hang},
  journal={arXiv preprint arXiv:2111.08276},
  year={2021}
}

Contact

For issues using this code, please submit a GitHub issue.