xinyu1205 / recognize-anything

Open-source and strong foundation image recognition models.
Apache License 2.0
2.74k stars 265 forks source link

Finetuning question #173

Open adbmdp opened 5 months ago

adbmdp commented 5 months ago


I would like to finetune RAM++ tagging with other datasets. I spent a lot of time trying to understand how it works. But there are still quite a few points that I don't understand :-(

But before i ask my question, would it be possible, for example, to train the model on a dataset with personalities (pictures and name)? So that RAM++ can tag them when I call A basic example would be a photo of Michael Jordan throwing a ball and i get a list of tags like: Micheal Jordan | basketball | basket | sport ...

Also would it be possible to train the model with more complexe action pictures? For example MMA / UFC (Mixed Martial Arts) pictures. Then by analyzing an image, RAM++ would be able to give me a list of tags like: ground fight | strike | top mount position stand up | uppercut


xinyu1205 commented 5 months ago

Thanks for your attention. Actually, this is certainly feasible. The performance of the model depends on the quality of your finetune dataset.

adbmdp commented 5 months ago

Thanks for your reply and your awesome work @xinyu1205 !!

OK let's say I want to train the model with a celebrity dataset.

I have trouble understanding which tag file I need to update with the new tags. To my understandings: parse_label_id refers to the tag indices present in ram/data/tag_list.txt union_label_id refers to the tag indices present in ram/data/ram_tag_list.txt

But for example when I watch in the COCO dataset for example can be found:

    "there is a woman that is cutting a white cake"

parse_label_id I should find the id in the ram/data/tag_list.txt file right? This file only has 3429 IDs and I see an id 4480 !

So to summarize. If I want to modify only the tagging part of RAM++. In which file should I add my tags (maybe just one)? And my Dataset can be something like:

    "Michael Jordan"
xinyu1205 commented 5 months ago

parse_label_id refers to the tag parsed from image caption union_label_id refers to the full tags of the image Therefore, if you only have image-tag dataset, you just need set image tags as union_label_id. And you only need the loss_tag and loss_dis in RAM or RAM++

adbmdp commented 5 months ago

Ok so I just need:

    "Michael Jordan"
     new id from ram/data/ram_tag_list.txt

And you only need the loss_tag and loss_dis in RAM or RAM++

I don't know what you mean here but i'll try to find out. Do I have to change some code in

Thanks again for taking from you time to reply 👍 🥇

xinyu1205 commented 5 months ago

It means you need to modify the forward function of or And I strongly recommend that you read the RAM or RAM++paper before completing these tasks.

adbmdp commented 5 months ago

Thanks. I'll do that.

adbmdp commented 4 months ago

So i'm trying to fine-tune the model on just one tag as a test (on my CPU). I've add a new tag in recognize-anything/ram/data/ram_tag_list.txt so now there is 4586 lines in this file.

I've modified the forward function:

def forward(self, image, caption, image_tag, clip_feature, batch_text_embed):
        image_embeds = self.image_proj(self.visual_encoder(image))
        image_atts = torch.ones(image_embeds.size()[:-1],

        ##================= Distillation from CLIP ================##
        image_cls_embeds = image_embeds[:, 0, :]
        image_spatial_embeds = image_embeds[:, 1:, :]

        loss_dis = F.l1_loss(image_cls_embeds, clip_feature)

        ###===========multi tag des reweight==============###
        bs = image_embeds.shape[0]

        des_per_class = int(self.label_embed.shape[0] / self.num_class)

        image_cls_embeds = image_cls_embeds / image_cls_embeds.norm(dim=-1, keepdim=True)
        reweight_scale = self.reweight_scale.exp()
        logits_per_image = (reweight_scale * image_cls_embeds @ self.label_embed.t())
        logits_per_image = logits_per_image.view(bs, -1, des_per_class)

        weight_normalized = F.softmax(logits_per_image, dim=2)
        label_embed_reweight = torch.empty(bs, self.num_class, 512).to(image.device).to(image.dtype)

        for i in range(bs):
            reshaped_value = self.label_embed.view(-1, des_per_class, 512)
            product = weight_normalized[i].unsqueeze(-1) * reshaped_value
            label_embed_reweight[i] = product.sum(dim=1)

        label_embed = torch.nn.functional.relu(self.wordvec_proj(label_embed_reweight))

        ##================= Image Tagging ================##

        tagging_embed = self.tagging_head(

        logits = self.fc(tagging_embed[0]).squeeze(-1)

        loss_tag = self.tagging_loss_function(logits, image_tag)

        # Ignorez la perte d'alignement texte-image
        loss_alignment = None

        # Renvoyez les pertes loss_tag et loss_dis
        return loss_tag, loss_dis

Here is my finetune.yaml file :

train_file: [
image_path_root: ""

# size of vit model; base or large
vit: 'swin_l'
vit_grad_ckpt: False
vit_ckpt_layer: 0

image_size: 384
batch_size: 26

# optimizer
weight_decay: 0.05
init_lr: 5e-06
min_lr: 0
max_epoch: 2
warmup_steps: 3000

class_num: 4586

I lauch the fine tuning like this: python3 --model-type ram_plus --config ram/configs/finetune.yaml --checkpoint outputs/ram_plus/ram_plus_swin_large_14m.pth --output-dir outputs/ram_plus_ft --device cpu

RuntimeError: Error(s) in loading state_dict for RAM_plus:
    size mismatch for label_embed: copying a param with shape torch.Size([233835, 512]) from checkpoint, the shape in current model is torch.Size([233886, 512]).

I think the error message indicates that there is a size mismatch between the pre-trained model's label_embed layer and the current model's label_embed layer. This is likely due to a difference in the number of tags or classes between the pre-trained model and the current model. But I have no clue how to resolve this.
