xxxnell / how-do-vits-work

(ICLR 2022 Spotlight) Official PyTorch implementation of "How Do Vision Transformers Work?"
https://arxiv.org/abs/2202.06709
Apache License 2.0
798 stars 77 forks source link

Lesion study #41

Closed liguopeng0923 closed 6 months ago

liguopeng0923 commented 6 months ago

Hi @xxxnell ,

I find it hard to understand the conclusions about the lesion study. For example, VIT has not satisfied your conclusion (i.e., the latter MSA is more important)

image
xxxnell commented 6 months ago

Hi @liguopeng0923 ,

Thank you for the great question. Indeed, when we discuss shallow Vision Transformers or even Swin Transformers, which are serial connections of shallow Transformers, the statement -- that the latter self-attentions are more important -- holds precisely. Nonetheless, for deep Vision Transformers, the middle self-attention layers take on a prominent role in determining accuracy, thus outshining the latter self-attention layers.

This occurrence can be linked to a phenomenon known as attention collapse. Here, the latter self-attention maps constrict into specific patterns that are not significantly influenced by the query token. Consequently, the latter several self-attention layers tends to lose their importance, particularly in classification tasks. Essentially, this observation infers that Vision Transformers can encounter the scaling problem with respect to their depth. DeepViT also tried to address this issue.

Happy holidays!