xxxnell / how-do-vits-work

(ICLR 2022 Spotlight) Official PyTorch implementation of "How Do Vision Transformers Work?"
https://arxiv.org/abs/2202.06709
Apache License 2.0
806 stars 79 forks source link

Why is feed-forward not present in the paper and the code? #17

Closed waitingcheung closed 2 years ago

waitingcheung commented 2 years ago

The original ViT and many ViT variants have feed-forward in their architectures. I noticed that feed-forward is neither mentioned in the paper nor implemented in the code of AlterNet. It would be interesting to learn about the intuitions behind such a design choice.

xxxnell commented 2 years ago

Hi , @waitingcheung. Thanks for that question. Both feed-forward networks (FFNs or MLPs) in ViTs and Conv blocks mix channel information without exploiting data information, so they behave similarly to each other (e.g., they pass high-frequency information, increase the feature map variance, sharpen loss landscapes, and so on. Please refer to e.g Fig 8 and Fig 9) Therefore, in AlterNet, ResNet blocks play the role of FFN blocks.