Closed fawnliu closed 2 years ago
@scarleatt Hi! During inference, the context encoder calculates a language-guided self-attention map for the input image, which is of the shape HWxHW. You can first cache the entire self-attention map and then obtain the attention map for any point from this self-attention map by its location. Reshape the obtained attention map to HxW and you can visualize it.
Thank you very much for the quick reply, I'll have a try.
Thank you for your great work.
Could you tell me how to visualize the attention map for a point in Fig.4, or share its code?