Open adbmdp opened 5 months ago
Thanks for your attention. Actually, this is certainly feasible. The performance of the model depends on the quality of your finetune dataset.
Thanks for your reply and your awesome work @xinyu1205 !!
OK let's say I want to train the model with a celebrity dataset.
I have trouble understanding which tag file I need to update with the new tags.
To my understandings:
parse_label_id
refers to the tag indices present in ram/data/tag_list.txt
union_label_id
refers to the tag indices present in ram/data/ram_tag_list.txt
But for example when I watch in the COCO dataset for example can be found:
{
"image_path":"coco/val2014/COCO_val2014_000000522418.jpg",
"parse_label_id":[
[
4480,
4532,
678
]
],
"caption":[
"there is a woman that is cutting a white cake"
],
"union_label_id":[
4480,
2624,
2051,
678,
2599,
2577,
4532,
1238,
215,
2332,
4439
]
}
parse_label_id
I should find the id in the ram/data/tag_list.txt
file right?
This file only has 3429 IDs and I see an id 4480 !
So to summarize. If I want to modify only the tagging part of RAM++. In which file should I add my tags (maybe just one)? And my Dataset can be something like:
{
"image_path":"datasets/celebrities/CELEB_00001.jpg",
"parse_label_id":[
[
9999
]
],
"caption":[
"Michael Jordan"
],
"union_label_id":[
8888
]
}
parse_label_id refers to the tag parsed from image caption union_label_id refers to the full tags of the image Therefore, if you only have image-tag dataset, you just need set image tags as union_label_id. And you only need the loss_tag and loss_dis in RAM or RAM++
Ok so I just need:
{
"image_path":"datasets/celebrities/CELEB_00001.jpg",
"caption":[
"Michael Jordan"
],
"union_label_id":[
new id from ram/data/ram_tag_list.txt
]
}
And you only need the loss_tag and loss_dis in RAM or RAM++
I don't know what you mean here but i'll try to find out. Do I have to change some code in finetune.py?
Thanks again for taking from you time to reply 👍 🥇
It means you need to modify the forward function of ram.py or ram_plus.py. And I strongly recommend that you read the RAM or RAM++paper before completing these tasks.
Thanks. I'll do that.
So i'm trying to fine-tune the model on just one tag as a test (on my CPU).
I've add a new tag in recognize-anything/ram/data/ram_tag_list.txt
so now there is 4586 lines in this file.
I've modified the forward function:
def forward(self, image, caption, image_tag, clip_feature, batch_text_embed):
image_embeds = self.image_proj(self.visual_encoder(image))
image_atts = torch.ones(image_embeds.size()[:-1],
dtype=torch.long).to(image.device)
##================= Distillation from CLIP ================##
image_cls_embeds = image_embeds[:, 0, :]
image_spatial_embeds = image_embeds[:, 1:, :]
loss_dis = F.l1_loss(image_cls_embeds, clip_feature)
###===========multi tag des reweight==============###
bs = image_embeds.shape[0]
des_per_class = int(self.label_embed.shape[0] / self.num_class)
image_cls_embeds = image_cls_embeds / image_cls_embeds.norm(dim=-1, keepdim=True)
reweight_scale = self.reweight_scale.exp()
logits_per_image = (reweight_scale * image_cls_embeds @ self.label_embed.t())
logits_per_image = logits_per_image.view(bs, -1, des_per_class)
weight_normalized = F.softmax(logits_per_image, dim=2)
label_embed_reweight = torch.empty(bs, self.num_class, 512).to(image.device).to(image.dtype)
for i in range(bs):
reshaped_value = self.label_embed.view(-1, des_per_class, 512)
product = weight_normalized[i].unsqueeze(-1) * reshaped_value
label_embed_reweight[i] = product.sum(dim=1)
label_embed = torch.nn.functional.relu(self.wordvec_proj(label_embed_reweight))
##================= Image Tagging ================##
tagging_embed = self.tagging_head(
encoder_embeds=label_embed,
encoder_hidden_states=image_embeds,
encoder_attention_mask=image_atts,
return_dict=False,
mode='tagging',
)
logits = self.fc(tagging_embed[0]).squeeze(-1)
loss_tag = self.tagging_loss_function(logits, image_tag)
# Ignorez la perte d'alignement texte-image
loss_alignment = None
# Renvoyez les pertes loss_tag et loss_dis
return loss_tag, loss_dis
Here is my finetune.yaml file :
train_file: [
'outputs/data.json',
]
image_path_root: ""
# size of vit model; base or large
vit: 'swin_l'
vit_grad_ckpt: False
vit_ckpt_layer: 0
image_size: 384
batch_size: 26
# optimizer
weight_decay: 0.05
init_lr: 5e-06
min_lr: 0
max_epoch: 2
warmup_steps: 3000
class_num: 4586
I lauch the fine tuning like this:
python3 finetune.py --model-type ram_plus --config ram/configs/finetune.yaml --checkpoint outputs/ram_plus/ram_plus_swin_large_14m.pth --output-dir outputs/ram_plus_ft --device cpu
RuntimeError: Error(s) in loading state_dict for RAM_plus:
size mismatch for label_embed: copying a param with shape torch.Size([233835, 512]) from checkpoint, the shape in current model is torch.Size([233886, 512]).
I think the error message indicates that there is a size mismatch between the pre-trained model's label_embed layer and the current model's label_embed layer. This is likely due to a difference in the number of tags or classes between the pre-trained model and the current model. But I have no clue how to resolve this.
Thanks!
Hello,
I would like to finetune RAM++ tagging with other datasets. I spent a lot of time trying to understand how it works. But there are still quite a few points that I don't understand :-(
But before i ask my question, would it be possible, for example, to train the model on a dataset with personalities (pictures and name)? So that RAM++ can tag them when I call inference_ram_plus.py? A basic example would be a photo of Michael Jordan throwing a ball and i get a list of tags like:
Micheal Jordan | basketball | basket | sport ...
Also would it be possible to train the model with more complexe action pictures? For example MMA / UFC (Mixed Martial Arts) pictures. Then by analyzing an image, RAM++ would be able to give me a list of tags like:
ground fight | strike | top mount position
stand up | uppercut
Thanks!