Lesion study - Githubissues

Hi @liguopeng0923 ,

Thank you for the great question. Indeed, when we discuss shallow Vision Transformers or even Swin Transformers, which are serial connections of shallow Transformers, the statement -- that the latter self-attentions are more important -- holds precisely. Nonetheless, for deep Vision Transformers, the middle self-attention layers take on a prominent role in determining accuracy, thus outshining the latter self-attention layers.

This occurrence can be linked to a phenomenon known as attention collapse. Here, the latter self-attention maps constrict into specific patterns that are not significantly influenced by the query token. Consequently, the latter several self-attention layers tends to lose their importance, particularly in classification tasks. Essentially, this observation infers that Vision Transformers can encounter the scaling problem with respect to their depth. DeepViT also tried to address this issue.

Happy holidays!

xxxnell / how-do-vits-work

Lesion study #41