Download checkpoints for stage1, stage2, and the final model.
mkdir kosmosg_checkpoints
cd kosmosg_checkpoints
DLINK=$(echo -n "aHR0cHM6Ly9jb252ZXJzYXRpb25odWIuYmxvYi5jb3JlLndpbmRvd3MubmV0L2JlaXQtc2hhcmUtcHVibGljL2tvc21vc2cvVmlULUwtMTQtc2QucHQ/c3Y9MjAyMy0wMS0wMyZzdD0yMDI0LTA0LTEwVDEzJTNBMTElM0E0NFomc2U9MjA1MC0wNC0xMVQxMyUzQTExJTNBMDBaJnNyPWMmc3A9ciZzaWc9NGNYSklqVlJaSElCV3FIalBnRG4lMkYwMW9jenBEV1hpcG1QQ1VrM1o4dmJRJTNE" | base64 --decode)
wget -O ViT-L-14-sd.pt $DLINK
DLINK=$(echo -n "aHR0cHM6Ly9jb252ZXJzYXRpb25odWIuYmxvYi5jb3JlLndpbmRvd3MubmV0L2JlaXQtc2hhcmUtcHVibGljL2tvc21vc2cvY2hlY2twb2ludF9zdGFnZTEucHQ/c3Y9MjAyMy0wMS0wMyZzdD0yMDI0LTA0LTEwVDEzJTNBMTElM0E0NFomc2U9MjA1MC0wNC0xMVQxMyUzQTExJTNBMDBaJnNyPWMmc3A9ciZzaWc9NGNYSklqVlJaSElCV3FIalBnRG4lMkYwMW9jenBEV1hpcG1QQ1VrM1o4dmJRJTNE" | base64 --decode)
wget -O checkpoint_stage1.pt $DLINK
DLINK=$(echo -n "aHR0cHM6Ly9jb252ZXJzYXRpb25odWIuYmxvYi5jb3JlLndpbmRvd3MubmV0L2JlaXQtc2hhcmUtcHVibGljL2tvc21vc2cvY2hlY2twb2ludF9zdGFnZTIucHQ/c3Y9MjAyMy0wMS0wMyZzdD0yMDI0LTA0LTEwVDEzJTNBMTElM0E0NFomc2U9MjA1MC0wNC0xMVQxMyUzQTExJTNBMDBaJnNyPWMmc3A9ciZzaWc9NGNYSklqVlJaSElCV3FIalBnRG4lMkYwMW9jenBEV1hpcG1QQ1VrM1o4dmJRJTNE" | base64 --decode)
wget -O checkpoint_stage2.pt $DLINK
DLINK=$(echo -n "aHR0cHM6Ly9jb252ZXJzYXRpb25odWIuYmxvYi5jb3JlLndpbmRvd3MubmV0L2JlaXQtc2hhcmUtcHVibGljL2tvc21vc2cvY2hlY2twb2ludF9maW5hbC5wdD9zdj0yMDIzLTAxLTAzJnN0PTIwMjQtMDQtMTBUMTMlM0ExMSUzQTQ0WiZzZT0yMDUwLTA0LTExVDEzJTNBMTElM0EwMFomc3I9YyZzcD1yJnNpZz00Y1hKSWpWUlpISUJXcUhqUGdEbiUyRjAxb2N6cERXWGlwbVBDVWszWjh2YlElM0Q=" | base64 --decode)
wget -O checkpoint_final.pt $DLINK
You can use our built Docker Image
docker run --runtime=nvidia --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --name kosmosg --privileged=true -it -v /mnt:/mnt/ xichenpan/kosmosg:v1 /bin/bash
git clone https://github.com/microsoft/unilm.git
cd unilm/kosmos-g
pip install torchscale/
pip install open_clip/
pip install fairseq/
pip install infinibatch/
You can also start with NVIDIA Official Docker Image, and install all dependencies manually.
docker run --runtime=nvidia --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --name kosmosg --privileged=true -it -v /mnt:/mnt/ nvcr.io/nvidia/pytorch:22.10-py3 /bin/bash
apt-get install -y libsm6 libxext6 libxrender-dev
git clone https://github.com/microsoft/unilm.git
cd unilm/kosmos-g
bash vl_setup.sh
Make sure you have Pytorch 1.13.0 and nvcc 11.x installed.
git clone https://github.com/microsoft/unilm.git
cd unilm/kosmos-g
bash vl_setup.sh
If you would like to host a local Gradio demo, run the following command after setup:
bash runapp.sh
Be sure to adjust the guidance scale if you find the default one leads to over-saturated images.
Refer to this guide to prepare the dataset.
After preparing the data, run the following command to train the model. Be sure to change the directories in the script to your own. For the image decoder aligning stage:
bash runalign.sh
For the instruction tuning stage:
bash runtrain.sh
Download and unzip the COCO (2014) val set:
mkdir coco
cd coco
wget http://images.cocodataset.org/zips/val2014.zip
wget http://images.cocodataset.org/annotations/annotations_trainval2014.zip
unzip val2014.zip
Specify the cfg in sample_kosmosg_coco.py
and run the script to evaluate:
bash runeval_coco.sh
Download DreamBench:
mkdir dreambench
cd dreambench
git clone https://github.com/google/dreambooth.git
We keep only one image for each entity as described in our paper.
bash scripts/remove_dreambench_multiimg.sh /path/to/dreambench/dreambooth/dataset
Specify the cfg in sample_kosmosg_dreambench.py
and run the script to evaluate:
bash runeval_dreambench.sh
If you find this repository useful, please consider citing our work:
@article{kosmos-g,
title={{Kosmos-G}: Generating Images in Context with Multimodal Large Language Models},
author={Xichen Pan and Li Dong and Shaohan Huang and Zhiliang Peng and Wenhu Chen and Furu Wei},
journal={ArXiv},
year={2023},
volume={abs/2310.02992}
}
Kosmos-G is purely a research project. Currently, we have no plans to incorporate Kosmos-G into a product or expand access to the public. We will also put Microsoft AI principles into practice when further developing the models.
In our research paper, we account for the ethical concerns associated with text-to-image research. To mitigate issues associated with training data, we have implemented a rigorous filtering process to purge our training data of inappropriate content, such as explicit imagery and offensive language, to minimize the likelihood of generating inappropriate content.
This repository is built using torchscale, fairseq, openclip. We thank the authors of Nerfies that kindly open sourced the template of the project page.
This project is licensed under the license found in the LICENSE file in the root directory of this source tree.
Microsoft Open Source Code of Conduct
For help or issues using models, please submit a GitHub issue.