This project aims to develop a series of open-source and strong fundamental image recognition models.
Recognize Anything Plus Model (RAM++) [Paper]
RAM++ is the next generation of RAM, which can recognize any category with high accuracy, including both predefined common categories and diverse open-set categories.
Recognize Anything Model (RAM) [Paper][Demo]
RAM is an image tagging model, which can recognize any common category with high accuracy.
RAM is accepted at CVPR 2024 Multimodal Foundation Models Workshop.
Tag2Text (ICLR 2024) [Paper] [Demo]
Tag2Text is a vision-language model guided by tagging, which can support tagging and comprehensive captioning simultaneously.
Tag2Text is accepted at ICLR 2024! See you in Vienna!
RAM++ outperforms existing SOTA image fundamental recognition models on common tag categories, uncommon tag categories, and human-object interaction phrases.
Comparison of zero-shot image recognition performance.
We have combined Tag2Text and RAM with localization models (Grounding-DINO and SAM) and developed a strong visual semantic analysis pipeline in the Grounded-SAM project.
(Green color means fully supervised learning and others means zero-shot performance.)
RAM++ demonstrate a significant improvement in open-set category recognition.
(Green color means fully supervised learning and Blue color means zero-shot performance.)
Tag2Text generate more comprehensive captions with tagging guidance.
Tag2Text provides tags as additional visible alignment indicators.
These annotation files come from the Tag2Text and RAM. Tag2Text automatically extracts image tags from image-text pairs. RAM further augments both tags and texts via an automatic data engine.
DataSet | Size | Images | Texts | Tags |
---|---|---|---|---|
COCO | 168 MB | 113K | 680K | 3.2M |
VG | 55 MB | 100K | 923K | 2.7M |
SBU | 234 MB | 849K | 1.7M | 7.6M |
CC3M | 766 MB | 2.8M | 5.6M | 28.2M |
CC3M-val | 3.5 MB | 12K | 26K | 132K |
CC12M to be released in the next update.
These tag descriptions files come from the RAM++ by calling GPT api. You can also customize any tag categories by generate_tag_des_llm.py.
Tag Descriptions | Tag List |
---|---|
RAM Tag List | 4,585 |
OpenImages Uncommon | 200 |
Note : you need to create 'pretrained' folder and download these checkpoints into this folder.
Name | Backbone | Data | Illustration | Checkpoint | |
---|---|---|---|---|---|
1 | RAM++ (14M) | Swin-Large | COCO, VG, SBU, CC3M, CC3M-val, CC12M | Provide strong image tagging ability for any category. | Download link |
2 | RAM (14M) | Swin-Large | COCO, VG, SBU, CC3M, CC3M-val, CC12M | Provide strong image tagging ability for common category. | Download link |
3 | Tag2Text (14M) | Swin-Base | COCO, VG, SBU, CC3M, CC3M-val, CC12M | Support comprehensive captioning and tagging. | Download link |
conda create -n recognize-anything python=3.8 -y
conda activate recognize-anything
recognize-anything
as a package:pip install git+https://github.com/xinyu1205/recognize-anything.git
git clone https://github.com/xinyu1205/recognize-anything.git
cd recognize-anything
pip install -e .
Then the RAM++, RAM, and Tag2Text models can be imported in other projects:
from ram.models import ram_plus, ram, tag2text
Get the English and Chinese outputs of the images:
python inference_ram_plus.py --image images/demo/demo1.jpg --pretrained pretrained/ram_plus_swin_large_14m.pth
The output will look like the following:
Image Tags: armchair | blanket | lamp | carpet | couch | dog | gray | green | hassock | home | lay | living room | picture frame | pillow | plant | room | wall lamp | sit | wood floor
图像标签: 扶手椅 | 毯子/覆盖层 | 灯 | 地毯 | 沙发 | 狗 | 灰色 | 绿色 | 坐垫/搁脚凳/草丛 | 家/住宅 | 躺 | 客厅 | 相框 | 枕头 | 植物 | 房间 | 壁灯 | 坐/放置/坐落 | 木地板
We have released the LLM tag descriptions of OpenImages-Uncommon categories in openimages_rare_200_llm_tag_descriptions.
python inference_ram_plus_openset.py --image images/openset_example.jpg \ --pretrained pretrained/ram_plus_swin_large_14m.pth \ --llm_tag_des datasets/openimages_rare_200/openimages_rare_200_llm_tag_descriptions.json
The output will look like the following:
Image Tags: Close-up | Compact car | Go-kart | Horse racing | Sport utility vehicle | Touring car
Modify categories, and call GPT api to generate corresponding tag descriptions:
python generate_tag_des_llm.py \ --openai_api_key 'your openai api key' \ --output_file_path datasets/openimages_rare_200/openimages_rare_200_llm_tag_descriptions.json
We release two datasets OpenImages-common
(214 common tag classes) and OpenImages-rare
(200 uncommon tag classes). Copy or sym-link test images of OpenImages v6 to datasets/openimages_common_214/imgs/
and datasets/openimages_rare_200/imgs
.
To evaluate RAM++ on OpenImages-common
:
python batch_inference.py \
--model-type ram_plus \
--checkpoint pretrained/ram_plus_swin_large_14m.pth \
--dataset openimages_common_214 \
--output-dir outputs/ram_plus
To evaluate RAM++ open-set capability on OpenImages-rare
:
python batch_inference.py \
--model-type ram_plus \
-- pretrained/ram_plus_swin_large_14m.pth \
--open-set \
--dataset openimages_rare_200 \
--output-dir outputs/ram_plus_openset
To evaluate RAM on OpenImages-common
:
python batch_inference.py \
--model-type ram \
-- pretrained/ram_swin_large_14m.pth \
--dataset openimages_common_214 \
--output-dir outputs/ram
To evaluate RAM open-set capability on OpenImages-rare
:
python batch_inference.py \
--model-type ram \
-- pretrained/ram_swin_large_14m.pth \
--open-set \
--dataset openimages_rare_200 \
--output-dir outputs/ram_openset
To evaluate Tag2Text on OpenImages-common
:
python batch_inference.py \
--model-type tag2text \
-- pretrained/tag2text_swin_14m.pth \
--dataset openimages_common_214 \
--output-dir outputs/tag2text
Please refer to batch_inference.py
for more options. To get P/R in table 3 of RAM paper, pass --threshold=0.86
for RAM and --threshold=0.68
for Tag2Text.
To batch inference custom images, you can set up you own datasets following the given two datasets.
Download RAM training datasets where each json file contains a list. Each item in the list is a dictonary with three key-value pairs: {'image_path': path_of_image, 'caption': text_of_image, 'union_label_id': image tags for tagging which including parsed tags and pseudo tags }.
In ram/configs/pretrain.yaml, set 'train_file' as the paths for the json files.
Prepare pretained Swin-Transformer, and set 'ckpt' in ram/configs/swin.
Download RAM++ frozen tag embedding file "ram_plus_tag_embedding_class_4585_des_51.pth", and set file in "ram/data/frozen_tag_embedding/ram_plus_tag_embedding_class_4585_des_51.pth"
Pre-train the model using 8 A100 GPUs:
python -m torch.distributed.run --nproc_per_node=8 pretrain.py \
--model-type ram_plus \
--config ram/configs/pretrain.yaml \
--output-dir outputs/ram_plus
python -m torch.distributed.run --nproc_per_node=8 finetune.py \
--model-type ram_plus \
--config ram/configs/finetune.yaml \
--checkpoint outputs/ram_plus/checkpoint_04.pth \
--output-dir outputs/ram_plus_ft
If you find our work to be useful for your research, please consider citing.
@article{huang2023open,
title={Open-Set Image Tagging with Multi-Grained Text Supervision},
author={Huang, Xinyu and Huang, Yi-Jie and Zhang, Youcai and Tian, Weiwei and Feng, Rui and Zhang, Yuejie and Xie, Yanchun and Li, Yaqian and Zhang, Lei},
journal={arXiv e-prints},
pages={arXiv--2310},
year={2023}
}
@article{zhang2023recognize,
title={Recognize Anything: A Strong Image Tagging Model},
author={Zhang, Youcai and Huang, Xinyu and Ma, Jinyu and Li, Zhaoyang and Luo, Zhaochuan and Xie, Yanchun and Qin, Yuzhuo and Luo, Tong and Li, Yaqian and Liu, Shilong and others},
journal={arXiv preprint arXiv:2306.03514},
year={2023}
}
@article{huang2023tag2text,
title={Tag2Text: Guiding Vision-Language Model via Image Tagging},
author={Huang, Xinyu and Zhang, Youcai and Ma, Jinyu and Tian, Weiwei and Feng, Rui and Zhang, Yuejie and Li, Yaqian and Guo, Yandong and Zhang, Lei},
journal={arXiv preprint arXiv:2303.05657},
year={2023}
}
This work is done with the help of the amazing code base of BLIP, thanks very much!
We want to thank @Cheng Rui @Shilong Liu @Ren Tianhe for their help in marrying RAM/Tag2Text with Grounded-SAM.
We also want to thank Ask-Anything, Prompt-can-anything for combining RAM/Tag2Text, which greatly expands the application boundaries of RAM/Tag2Text.