Some questions about grad-CAM showing in fig7 in paper Tag2Text.

SKBL5694 commented 1 year ago

I use gradCAM to visualize the same image as the paper, but I get a weird result different from the fig7. When I use the word "cat" to calculate the heatmap: the result is like this: when I change the word to "siamese"(a kind of cat), the result looks ok sorry, when I raise this issue, I find I use the weights of ram rather T2T, however, the gradCAM fig is in the T2T paper. But it still seems weird in ram, can you tell me what caused this? I don't remember whether swin-transformer is also used in t2t. If it is not used in t2t, I suspect it is caused by this; other than that, all I can think of is CLIP text encoder and text encoder trained by yourself in T2T.

xinyu1205 commented 1 year ago

Did you use the image-tag recognition decoder on the tagging task to obtain the Grad-Cam? Since the figure 7 of Tag2Text is obtained based on the backward gradient of the image-tag interaction encoder on the generation task. And I also found that the image-tag recognition decoder's Grad-Cam is often a meaningless scatter graph, even if it predicts high logits. Normally, with good recognition performance, its grad-cam should be very accurate. I haven't found the reason yet.

SKBL5694 commented 1 year ago

It seems that I am indeed doing gradcam on the recognition task, because your code does not open the generation task for RAM, but I have added the generation task to RAM in the way of T2T, I will test it, thank you. Also, I share your opinion that "with good recognition performance, its grad-cam should be very accurate". But I also encountered a similar situation pytorch-grad-cam/issues/84 in some other swin-transformer-based discussions about gradcam, but unfortunately these discussions were fruitless. I think it may be due to the patch merging operation in swin-transformer that the features lose the meaning of the traditional spatial structure, but this can't explain why gradcam is accurate sometimes, and I am also very confused. Thanks for your reply though, I'll try it again. Thank you again for your excellent work and kind reply.

xinyu1205 commented 1 year ago

Thank you for your interest and your kind words, welcome to provide feedback if you have more issues.

SKBL5694 commented 1 year ago

I think I have some problem with image-tag interaction encoder performing backward calculation grad. The calculation method I use is to pre-define a hook, and then register it to the location I need. The general idea is as follows

def backward_hook(module, grad_input, grad_output):
     global gradients
     print('Backward hook running...')
     gradients = grad_output
     print(f'Gradients size: {gradients[0].size()}')
backward_hook = model.visual_encoder.register_full_backward_hook(backward_hook)

Then do backward on the category where I need to calculate the gradient. For example, in the previous recognition decoder, I can perform the following operations on any type of logits logits[0,252-1].backward() (where 252 is the number of lines where the word "cat" is located in ram_tag_list) But now, the interaction encoder output is not a scalar, but a shape of (#beam, max_length, #features) eg: (3, 40, 768) You mentioned that the grad of fig7 is obtained from the interaction encoder, does it mean that I should perform backward on this output to calculate the gradient? Or is there any other operation? In addition, I also tried to perform .backward() on the output of the text generation decoder, but since self.text_decoder is an instance of the official transformer library, its generate method does not contain grad, so I cannot perform backward() on the output of this , hope you can give me some idea, I want to try to get similar results to fig7.

pribadihcr commented 8 months ago

Hi @SKBL5694 , Have you resolve it?

xinyu1205 / recognize-anything

Some questions about grad-CAM showing in fig7 in paper Tag2Text. #70