How was the CLIP image encoder used to create image embeddings?

xinyu1205 / recognize-anything

Open-source and strong foundation image recognition models.

https://recognize-anything.github.io/

Apache License 2.0

2.58k stars 244 forks source link

How was the CLIP image encoder used to create image embeddings? #94

Closed vinsis closed 9 months ago

vinsis commented 9 months ago

Hi @xinyu1205 , thank you for sharing this work. In the paper there is a mention of using CLIP to get image embeddings

We also adopt the CLIP image encoder to distill image feature, which further improves the model’s recognition ability for unseen categories via image-text feature alignment.

I cannot find more details in the paper or the code about it. Can you provide more details about:

i) how the image embeddings were calculated using CLIP ViT? ii) did you combine the CLIP ViT embeddings with the SwinL embeddings? If so, how?

xinyu1205 commented 9 months ago

Hi, thanks for your attention. The motivation behind this is that the CLIP image encoder and text encoder are already alignment, so we want to maintain this property through distillation. i) At each step, we directly put the batch images into CLIP ViT to get the CLS image embedding; ii) Then we perform L2 loss on CLIP CLS image embedding with Swin global image embedding. I recommend you can further refer to the following two papers, which also adopt similar ideas.

[1] Open-vocabulary multi-label classification via multi-modal knowledge transfer. AAAI 2023 [2] Open-vocabulary Object Detection via Vision and Language Knowledge Distillation. ICLR 2022

vinsis commented 9 months ago

Thank you

jurijsnazarovsambientai commented 5 months ago

I saw in fine-tune code that you utilize model_clip to get image embedding and compute distillation loss. However, it looks like you don't use with torch.no_trad to compute clip image embeddings. Would not the distillation then update gradient of clip image embedding layer too in addition to RAM image embedding?

xinyu1205 commented 5 months ago

I saw in fine-tune code that you utilize model_clip to get image embedding and compute distillation loss. However, it looks like you don't use with torch.no_trad to compute clip image embeddings. Would not the distillation then update gradient of clip image embedding layer too in addition to RAM image embedding?

Hi, thanks very much for pointing out this issue! In my original code base, I set the gradient of CLIP model to False. But due to my negligence, I did not include it in the open-source code base.

for _, param in model_clip.named_paramters():
  param.requires_grad = False

I have updated the code version and added above codes and with torch.no_grad(). Details are provided in https://github.com/xinyu1205/recognize-anything/commit/dc5402256107b78bc256b8cf0f251a5c89558560. Once again, thank you so much for your reply, which helps to refine the RAM code base. Best regards!

jurijsnazarovsambientai commented 5 months ago

No problems, glad my understanding was correct and thanks for getting back to me so quickly.

By the way, will you be able to share weights for trained model based on SWIN-T or whatever smallest version of image-encoder you have?