xxxnell / how-do-vits-work

(ICLR 2022 Spotlight) Official PyTorch implementation of "How Do Vision Transformers Work?"
https://arxiv.org/abs/2202.06709
Apache License 2.0
798 stars 77 forks source link

Findings not compatible with other work? #27

Closed iumyx2612 closed 1 year ago

iumyx2612 commented 1 year ago

In figure 1 of the paper, authors stated that MSA flattens the loss landscape, however, in When Vision Transformer outperform ResNets without pre-training or strong data augmentation, they stated that ViT converge at sharp local minima, which is contrast to your findings?

Furthermore, authors claim that "The magnitude of the Hessian eigenvalues of ViT is smaller than that of ResNet during training phase" (Fig 1 still). However, in above paper, the "Hessian dominate eigenvalue" of ViT are "orders of magnitude larger than that of ResNet" (Table 1).

Loss landscape and Hessian max eigenvalue of your work: image

Loss landscape and Hessian max eigenvalue of other work: image

xxxnell commented 1 year ago

Hi @iumyx2612,

The difference between the observations in “When Vision Transformers Outperform ResNets without Pre-training or Strong Data Augmentations” and our results is due to the following aspects. Most importantly, our experimental settings are quite different from theirs. They used training configurations that is significantly different from standard practice (e.g., they do not use strong data augmentations), while we use a DeiT-style configuration. Since DeiT-style configuration is the de facto standard in ViT training, we believe our insights can be applied to a larger number of studies. In addition, they compare ViT-B (#Param: 87M) and ResNet-152 (#Param: 60M). Since ViT is parameter-efficient but computation-inefficient, we believe that it is better to compare ViT with 'ResNet with the same or smaller parameter size'. We also use NLL + L2 as the objectives instead of using only NLL.

Other observations are consistent with our claims. For example: a box blur (the simplest low-pass filter) also flattens the loss landscapes; Hybrid models have flat losses; ViT is robust against data/adversatial perturbations; and so on.

iumyx2612 commented 1 year ago

Hi @iumyx2612,

The difference between the observations in “When Vision Transformers Outperform ResNets without Pre-training or Strong Data Augmentations” and our results is due to the following aspects. Most importantly, our experimental settings are quite different from theirs. They used training configurations that is significantly different from standard practice (e.g., they do not use strong data augmentations), while we use a DeiT-style configuration. Since DeiT-style configuration is the de facto standard in ViT training, we believe our insights can be applied to a larger number of studies. In addition, they compare ViT-B (#Param: 87M) and ResNet-152 (#Param: 60M). Since ViT is parameter-efficient but computation-inefficient, we believe that it is better to compare ViT with 'ResNet with the same or smaller parameter size'. We also use NLL + L2 as the objectives instead of using only NLL.

Other observations are consistent with our claims. For example: a box blur (the simplest low-pass filter) also flattens the loss landscapes; Hybrid models have flat losses; ViT is robust against data/adversatial perturbations; and so on.

So MSA itself alone cannot flatten the loss landscape, and should be consider with other factors as well (like Objective function, Data augmentation) right?

xxxnell commented 1 year ago

As you correctly pointed out, experimental settings, such as data augmentations, losses, datasets, domains, and early/later phase of training, can certainly affect the behaviours of self-attention. But on the other hand, in my humble opinion, the role of data augmentation is underexplored. I would like to leave the detailed investigation for future work.

iumyx2612 commented 1 year ago

As you correctly pointed out, experimental settings, such as data augmentations, losses, datasets, domains, and early/later phase of training, can certainly affect the behaviours of self-attention. But on the other hand, in my humble opinion, the role of data augmentation is underexplored. I would like to leave the detailed investigation for future work.

Thank you for your sharing, much appreciated!