Image-Text Retrieval with RAM++

bkamz27 commented 7 months ago

I've been working with your model for image text retrieval, and I'm encountering some challenges in replicating the results in Table 7 of your paper.

I've tried using image embeddings (using RAM++) and text embeddings (using CLIP ViT-B/16). When I try to run an image-text retrieval task on Flickr30k, I don't get good results. I basically want to perform a cosine similarity between a text embedding from a caption and an image embedding. Then I want to calculate the recall numbers. I tried this approach with BLIP and I was able to get the results in the table. Even though BLIP gets good results, it is slower and it doesn't provide tags like RAM++. I would really like to make it work.

Could you provide some guidance on how you performed your image retrieval tasks?

I appreciate the work you've put into this project and any guidance you can provide. Thank you,

xinyu1205 commented 7 months ago

Hi, thanks for your attention. You need to interact image embeddings with text embeddings in the tagging_head, just like the tagging process. Please feel free to share if you have any progress or error.

bkamz27 commented 7 months ago

Thank you for your reply. I tried to understand the tagging_head process more. What I have is:

Image embeddigns using RAM++ image_embeds = model.image_proj(self.visual_encoder(image))
Text embeddings using build_text_embed() function that creates CLIP embeddings. batch_text_embed = build_text_embed(model_clip, prompt)
tagging_head takes the image and text embeddings and outputs an “alignment_embedding”. One question about the mode, should the mode stay as "tagging" for image text alignment?
```
batch_text_embed = torch.nn.functional.relu(self.wordvec_proj(batch_text_embed.to(self.label_embed.dtype)))
batch_text_embed = batch_text_embed.unsqueeze(0).repeat(bs, 1, 1)
alignment_embedding = self.tagging_head(
        encoder_embeds=batch_text_embed,
        encoder_hidden_states=image_embeds,
        encoder_attention_mask=image_atts,
        return_dict=False,
        mode='tagging',
    )
alignment_logits = self.fc(alignment_embedding[0]).squeeze(-1)
```
In order to do image retrieval, we need some kind of similarity score so we can rank the images based on this score. I assume we should be able to obtain it from the “alignment_embedding”. Is that correct? Could you provide some guidance or examples on how to implement this for retrieval tasks?

xinyu1205 commented 7 months ago

Hi, you can refer to these codes. Please feel free to share if you have any progress or error.

xinyu1205 / recognize-anything

Image-Text Retrieval with RAM++ #119