Closed iumyx2612 closed 1 year ago
Hi @iumyx2612,
The difference between the observations in “When Vision Transformers Outperform ResNets without Pre-training or Strong Data Augmentations” and our results is due to the following aspects. Most importantly, our experimental settings are quite different from theirs. They used training configurations that is significantly different from standard practice (e.g., they do not use strong data augmentations), while we use a DeiT-style configuration. Since DeiT-style configuration is the de facto standard in ViT training, we believe our insights can be applied to a larger number of studies. In addition, they compare ViT-B (#Param: 87M) and ResNet-152 (#Param: 60M). Since ViT is parameter-efficient but computation-inefficient, we believe that it is better to compare ViT with 'ResNet with the same or smaller parameter size'. We also use NLL + L2 as the objectives instead of using only NLL.
Other observations are consistent with our claims. For example: a box blur (the simplest low-pass filter) also flattens the loss landscapes; Hybrid models have flat losses; ViT is robust against data/adversatial perturbations; and so on.
Hi @iumyx2612,
The difference between the observations in “When Vision Transformers Outperform ResNets without Pre-training or Strong Data Augmentations” and our results is due to the following aspects. Most importantly, our experimental settings are quite different from theirs. They used training configurations that is significantly different from standard practice (e.g., they do not use strong data augmentations), while we use a DeiT-style configuration. Since DeiT-style configuration is the de facto standard in ViT training, we believe our insights can be applied to a larger number of studies. In addition, they compare ViT-B (#Param: 87M) and ResNet-152 (#Param: 60M). Since ViT is parameter-efficient but computation-inefficient, we believe that it is better to compare ViT with 'ResNet with the same or smaller parameter size'. We also use NLL + L2 as the objectives instead of using only NLL.
Other observations are consistent with our claims. For example: a box blur (the simplest low-pass filter) also flattens the loss landscapes; Hybrid models have flat losses; ViT is robust against data/adversatial perturbations; and so on.
So MSA itself alone cannot flatten the loss landscape, and should be consider with other factors as well (like Objective function, Data augmentation) right?
As you correctly pointed out, experimental settings, such as data augmentations, losses, datasets, domains, and early/later phase of training, can certainly affect the behaviours of self-attention. But on the other hand, in my humble opinion, the role of data augmentation is underexplored. I would like to leave the detailed investigation for future work.
As you correctly pointed out, experimental settings, such as data augmentations, losses, datasets, domains, and early/later phase of training, can certainly affect the behaviours of self-attention. But on the other hand, in my humble opinion, the role of data augmentation is underexplored. I would like to leave the detailed investigation for future work.
Thank you for your sharing, much appreciated!
In figure 1 of the paper, authors stated that MSA flattens the loss landscape, however, in When Vision Transformer outperform ResNets without pre-training or strong data augmentation, they stated that ViT converge at sharp local minima, which is contrast to your findings?
Furthermore, authors claim that "The magnitude of the Hessian eigenvalues of ViT is smaller than that of ResNet during training phase" (Fig 1 still). However, in above paper, the "Hessian dominate eigenvalue" of ViT are "orders of magnitude larger than that of ResNet" (Table 1).
Loss landscape and Hessian max eigenvalue of your work:![image](https://user-images.githubusercontent.com/69593462/204477928-4ccdb707-369e-495f-a1a9-5720d7064cc6.png)
Loss landscape and Hessian max eigenvalue of other work:![image](https://user-images.githubusercontent.com/69593462/204478030-88120f95-85d7-4f4b-953f-5be5074875aa.png)