youweiliang / evit

Python code for ICLR 2022 spotlight paper EViT: Expediting Vision Transformers via Token Reorganizations
Apache License 2.0
162 stars 19 forks source link

cls_attn = attn[:, :, 0, 1:] #18

Open Bugs-Bunny01 opened 1 year ago

Bugs-Bunny01 commented 1 year ago

hi,thank you for your contributions to open source. but i have a problem. cls_attn = attn[:, :, 0, 1:] What does this code mean?

Bugs-Bunny01 commented 1 year ago

What do the four dimensions represent?

youweiliang commented 1 year ago

Hi, Thanks for your interest. The attn has 4 dimensions representing [batch size, number of heads, number of tokens, number of tokens]. The last two dimensions represent the attention maps on the images. The line cls_attn = attn[:, :, 0, 1:] is to extract the class attention (the attention values of the class token to the image tokens). As the class token is the first token, that leads to the slice [0, 1:] in the attention maps.

Bugs-Bunny01 commented 1 year ago

感谢您的回复,也就是说,attn的第三个维度的0代表着cls-token,第四个维度1:代表着除了cls,其他的token吗

youweiliang commented 1 year ago

Yes!