Open SKBL5694 opened 1 year ago
Did you use the image-tag recognition decoder on the tagging task to obtain the Grad-Cam? Since the figure 7 of Tag2Text is obtained based on the backward gradient of the image-tag interaction encoder on the generation task. And I also found that the image-tag recognition decoder's Grad-Cam is often a meaningless scatter graph, even if it predicts high logits. Normally, with good recognition performance, its grad-cam should be very accurate. I haven't found the reason yet.
It seems that I am indeed doing gradcam on the recognition task, because your code does not open the generation task for RAM, but I have added the generation task to RAM in the way of T2T, I will test it, thank you. Also, I share your opinion that "with good recognition performance, its grad-cam should be very accurate". But I also encountered a similar situation pytorch-grad-cam/issues/84 in some other swin-transformer-based discussions about gradcam, but unfortunately these discussions were fruitless. I think it may be due to the patch merging operation in swin-transformer that the features lose the meaning of the traditional spatial structure, but this can't explain why gradcam is accurate sometimes, and I am also very confused. Thanks for your reply though, I'll try it again. Thank you again for your excellent work and kind reply.
Thank you for your interest and your kind words, welcome to provide feedback if you have more issues.
I think I have some problem with image-tag interaction encoder performing backward calculation grad. The calculation method I use is to pre-define a hook, and then register it to the location I need. The general idea is as follows
def backward_hook(module, grad_input, grad_output):
global gradients
print('Backward hook running...')
gradients = grad_output
print(f'Gradients size: {gradients[0].size()}')
backward_hook = model.visual_encoder.register_full_backward_hook(backward_hook)
Then do backward on the category where I need to calculate the gradient. For example, in the previous recognition decoder, I can perform the following operations on any type of logits logits[0,252-1].backward() (where 252 is the number of lines where the word "cat" is located in ram_tag_list) But now, the interaction encoder output is not a scalar, but a shape of (#beam, max_length, #features) eg: (3, 40, 768) You mentioned that the grad of fig7 is obtained from the interaction encoder, does it mean that I should perform backward on this output to calculate the gradient? Or is there any other operation? In addition, I also tried to perform .backward() on the output of the text generation decoder, but since self.text_decoder is an instance of the official transformer library, its generate method does not contain grad, so I cannot perform backward() on the output of this , hope you can give me some idea, I want to try to get similar results to fig7.
Hi @SKBL5694 , Have you resolve it?
I use gradCAM to visualize the same image as the paper, but I get a weird result different from the fig7. When I use the word "cat" to calculate the heatmap: the result is like this: when I change the word to "siamese"(a kind of cat), the result looks ok sorry, when I raise this issue, I find I use the weights of ram rather T2T, however, the gradCAM fig is in the T2T paper. But it still seems weird in ram, can you tell me what caused this? I don't remember whether swin-transformer is also used in t2t. If it is not used in t2t, I suspect it is caused by this; other than that, all I can think of is CLIP text encoder and text encoder trained by yourself in T2T.