xinyu1205 / recognize-anything

Open-source and strong foundation image recognition models.
https://recognize-anything.github.io/
Apache License 2.0
2.59k stars 244 forks source link

Image-Text Retrieval with RAM++ #119

Open bkamz27 opened 7 months ago

bkamz27 commented 7 months ago

I've been working with your model for image text retrieval, and I'm encountering some challenges in replicating the results in Table 7 of your paper.

I've tried using image embeddings (using RAM++) and text embeddings (using CLIP ViT-B/16). When I try to run an image-text retrieval task on Flickr30k, I don't get good results. I basically want to perform a cosine similarity between a text embedding from a caption and an image embedding. Then I want to calculate the recall numbers. I tried this approach with BLIP and I was able to get the results in the table. Even though BLIP gets good results, it is slower and it doesn't provide tags like RAM++. I would really like to make it work.

Could you provide some guidance on how you performed your image retrieval tasks?

I appreciate the work you've put into this project and any guidance you can provide. Thank you,

xinyu1205 commented 7 months ago

Hi, thanks for your attention. You need to interact image embeddings with text embeddings in the tagging_head, just like the tagging process. Please feel free to share if you have any progress or error.

bkamz27 commented 7 months ago

Thank you for your reply. I tried to understand the tagging_head process more. What I have is:

  1. Image embeddigns using RAM++ image_embeds = model.image_proj(self.visual_encoder(image))

  2. Text embeddings using build_text_embed() function that creates CLIP embeddings. batch_text_embed = build_text_embed(model_clip, prompt)

  3. tagging_head takes the image and text embeddings and outputs an “alignment_embedding”. One question about the mode, should the mode stay as "tagging" for image text alignment?

    batch_text_embed = torch.nn.functional.relu(self.wordvec_proj(batch_text_embed.to(self.label_embed.dtype)))
    batch_text_embed = batch_text_embed.unsqueeze(0).repeat(bs, 1, 1)
    alignment_embedding = self.tagging_head(
            encoder_embeds=batch_text_embed,
            encoder_hidden_states=image_embeds,
            encoder_attention_mask=image_atts,
            return_dict=False,
            mode='tagging',
        )
    alignment_logits = self.fc(alignment_embedding[0]).squeeze(-1)

    In order to do image retrieval, we need some kind of similarity score so we can rank the images based on this score. I assume we should be able to obtain it from the “alignment_embedding”. Is that correct? Could you provide some guidance or examples on how to implement this for retrieval tasks?

xinyu1205 commented 7 months ago

Hi, you can refer to these codes. Please feel free to share if you have any progress or error.

1700465579993