cls_attn = attn[:, :, 0, 1:]

youweiliang / evit

Python code for ICLR 2022 spotlight paper EViT: Expediting Vision Transformers via Token Reorganizations

Apache License 2.0

162 stars 19 forks source link

cls_attn = attn[:, :, 0, 1:] #18

Open Bugs-Bunny01 opened 1 year ago

Bugs-Bunny01 commented 1 year ago

hi，thank you for your contributions to open source. but i have a problem. cls_attn = attn[:, :, 0, 1:] What does this code mean？

Bugs-Bunny01 commented 1 year ago

What do the four dimensions represent?

youweiliang commented 1 year ago

Hi, Thanks for your interest. The attn has 4 dimensions representing [batch size, number of heads, number of tokens, number of tokens]. The last two dimensions represent the attention maps on the images. The line cls_attn = attn[:, :, 0, 1:] is to extract the class attention (the attention values of the class token to the image tokens). As the class token is the first token, that leads to the slice [0, 1:] in the attention maps.

Bugs-Bunny01 commented 1 year ago

感谢您的回复，也就是说，attn的第三个维度的0代表着cls-token，第四个维度1：代表着除了cls，其他的token吗

youweiliang commented 1 year ago

Yes!