Question about Figure 2(a)

xxxnell / how-do-vits-work

(ICLR 2022 Spotlight) Official PyTorch implementation of "How Do Vision Transformers Work?"

https://arxiv.org/abs/2202.06709

Apache License 2.0

798 stars 77 forks source link

Question about Figure 2(a) #36

Closed iumyx2612 closed 1 year ago

iumyx2612 commented 1 year ago

Looking at this figure, I'm seeing that the early layers of ResNet has many low-freq components, and the deeper ResNet goes, it contains more high-freq components. Am I interpreting this figure right?

If I'm right, isn't this a little contradict to popular belief and visualization? That early layers in a ConvNet tend to learn high-freq components?

xxxnell commented 1 year ago

Hi @iumyx2612,

I believe a more appropriate interpretation of this figure is that the convolutional layers consistently amplify high-frequency components. Consequently, a significant amount of high-frequency information remains in the representations of deeper layers. My emphasis was on this trend of changes.

iumyx2612 commented 1 year ago

Hi @iumyx2612,

I believe a more appropriate interpretation of this figure is that the convolutional layers consistently amplify high-frequency components. Consequently, a significant amount of high-frequency information remains in the representations of deeper layers. My emphasis was on this trend of changes.

So looking at the figure, we can't say that early layers in a ConvNet contains many low frequency components right?

xxxnell commented 1 year ago

It is not easy to simply compare the amount of low-frequency components from different depths of layers. Instead, I'd like to say that having many low-frequency components in a representation does not necessarily mean the layers are learning low-frequency information in this case.

iumyx2612 commented 1 year ago

It is not easy to simply compare the amount of low-frequency components from different depths of layers. Instead, I'd like to say that having many low-frequency components in a representation does not necessarily mean the layers are learning low-frequency information in this case.

Oooh, thank you I got it!

iumyx2612 commented 1 year ago

I was reading the paper: MogaNet and having difficult understanding some part in the paper. Some of the explanations contradict to my understanding so I would like to exchange something with you (through emails or other platform since it's not relevant to this github) since some parts of the paper take inspiration from your work. Is it okay? Sorry if this bothers you

xxxnell commented 1 year ago

I am happy to do it. Please feel free to send me an email at namuk.park@gmail.com or park.namuk@gene.com.