xxxnell / how-do-vits-work

(ICLR 2022 Spotlight) Official PyTorch implementation of "How Do Vision Transformers Work?"
https://arxiv.org/abs/2202.06709
Apache License 2.0
798 stars 77 forks source link

model size #15

Closed forever10086 closed 2 years ago

forever10086 commented 2 years ago

hello,i have aquestion about why you use vit-s and vit-tiny,and counterpart is resnet-50,these size is not equal.i know you have explained on openview,i want to know whether vit-base's matrix eigenvalue spectrum is like vit-tiny in your paper,just stretch to the right.

xxxnell commented 2 years ago

Hi, @forever10086. Please refer to Fig C.7.c. The figure shows that the magnitude of the Hessian eigenvalue of ViT-S is even smaller than that of ResNet-50. I do not have results for ViT-B; but I guess the magnitude of the Hessian eigenvalue of ViT-B will be smaller because ViT-B has more heads and higher embedding dim. Please also refer to Fig C.5 and Fig C.6.

forever10086 commented 2 years ago

ok,i got it,but the number and the dimension of per head in Fig C.5 and Fig C.6 is from the same model like vit-s?i'm not sure the effect of lager heads and demension can resist the effect of bigger model like vit-base. besides,i know the vit matrix eigenvalue is more aggregated than resnet from the specturm,so i guess the bigger model could follow this priciple.

xxxnell commented 2 years ago

You're right. As you might expect, Fig C.5 and C.6 report results for models with various head numbers/embed dims. All other hyperparameters are the same. I would like to leave a detailed investigation of large models for future work.

forever10086 commented 2 years ago

ok.it is just a little thing.i think this paper very good,i like it

xxxnell commented 2 years ago

Thank you for the kind words :)