Open Bugs-Bunny01 opened 1 year ago
What do the four dimensions represent?
Hi,
Thanks for your interest.
The attn
has 4 dimensions representing [batch size, number of heads, number of tokens, number of tokens]. The last two dimensions represent the attention maps on the images.
The line cls_attn = attn[:, :, 0, 1:]
is to extract the class attention (the attention values of the class token to the image tokens). As the class token is the first token, that leads to the slice [0, 1:]
in the attention maps.
感谢您的回复,也就是说,attn的第三个维度的0代表着cls-token,第四个维度1:代表着除了cls,其他的token吗
Yes!
hi,thank you for your contributions to open source. but i have a problem. cls_attn = attn[:, :, 0, 1:] What does this code mean?