Open bkamz27 opened 7 months ago
Hi, thanks for your attention. You need to interact image embeddings with text embeddings in the tagging_head, just like the tagging process. Please feel free to share if you have any progress or error.
Thank you for your reply. I tried to understand the tagging_head process more. What I have is:
Image embeddigns using RAM++
image_embeds = model.image_proj(self.visual_encoder(image))
Text embeddings using build_text_embed() function that creates CLIP embeddings.
batch_text_embed = build_text_embed(model_clip, prompt)
tagging_head takes the image and text embeddings and outputs an “alignment_embedding”. One question about the mode, should the mode stay as "tagging" for image text alignment?
batch_text_embed = torch.nn.functional.relu(self.wordvec_proj(batch_text_embed.to(self.label_embed.dtype)))
batch_text_embed = batch_text_embed.unsqueeze(0).repeat(bs, 1, 1)
alignment_embedding = self.tagging_head(
encoder_embeds=batch_text_embed,
encoder_hidden_states=image_embeds,
encoder_attention_mask=image_atts,
return_dict=False,
mode='tagging',
)
alignment_logits = self.fc(alignment_embedding[0]).squeeze(-1)
In order to do image retrieval, we need some kind of similarity score so we can rank the images based on this score. I assume we should be able to obtain it from the “alignment_embedding”. Is that correct? Could you provide some guidance or examples on how to implement this for retrieval tasks?
Hi, you can refer to these codes. Please feel free to share if you have any progress or error.
I've been working with your model for image text retrieval, and I'm encountering some challenges in replicating the results in Table 7 of your paper.
I've tried using image embeddings (using RAM++) and text embeddings (using CLIP ViT-B/16). When I try to run an image-text retrieval task on Flickr30k, I don't get good results. I basically want to perform a cosine similarity between a text embedding from a caption and an image embedding. Then I want to calculate the recall numbers. I tried this approach with BLIP and I was able to get the results in the table. Even though BLIP gets good results, it is slower and it doesn't provide tags like RAM++. I would really like to make it work.
Could you provide some guidance on how you performed your image retrieval tasks?
I appreciate the work you've put into this project and any guidance you can provide. Thank you,