Closed waitingcheung closed 2 years ago
Hi , @waitingcheung. Thanks for that question. Both feed-forward networks (FFNs or MLPs) in ViTs and Conv blocks mix channel information without exploiting data information, so they behave similarly to each other (e.g., they pass high-frequency information, increase the feature map variance, sharpen loss landscapes, and so on. Please refer to e.g Fig 8 and Fig 9) Therefore, in AlterNet, ResNet blocks play the role of FFN blocks.
The original ViT and many ViT variants have feed-forward in their architectures. I noticed that feed-forward is neither mentioned in the paper nor implemented in the code of AlterNet. It would be interesting to learn about the intuitions behind such a design choice.