xxxnell / how-do-vits-work

(ICLR 2022 Spotlight) Official PyTorch implementation of "How Do Vision Transformers Work?"
https://arxiv.org/abs/2202.06709
Apache License 2.0
806 stars 79 forks source link

Lesion study #41

Closed liguopeng0923 closed 11 months ago

liguopeng0923 commented 11 months ago

Hi @xxxnell ,

I find it hard to understand the conclusions about the lesion study. For example, VIT has not satisfied your conclusion (i.e., the latter MSA is more important)

image
xxxnell commented 11 months ago

Hi @liguopeng0923 ,

Thank you for the great question. Indeed, when we discuss shallow Vision Transformers or even Swin Transformers, which are serial connections of shallow Transformers, the statement -- that the latter self-attentions are more important -- holds precisely. Nonetheless, for deep Vision Transformers, the middle self-attention layers take on a prominent role in determining accuracy, thus outshining the latter self-attention layers.

This occurrence can be linked to a phenomenon known as attention collapse. Here, the latter self-attention maps constrict into specific patterns that are not significantly influenced by the query token. Consequently, the latter several self-attention layers tends to lose their importance, particularly in classification tasks. Essentially, this observation infers that Vision Transformers can encounter the scaling problem with respect to their depth. DeepViT also tried to address this issue.

Happy holidays!