xxxnell / how-do-vits-work

(ICLR 2022 Spotlight) Official PyTorch implementation of "How Do Vision Transformers Work?"
https://arxiv.org/abs/2202.06709
Apache License 2.0
798 stars 77 forks source link

Question about harmonizing Convs with MSAs #35

Closed iumyx2612 closed 1 year ago

iumyx2612 commented 1 year ago

In the paper authors stated that: "MSAs are low-pass filters, but Convs are high-pass filters". And authors proposed how to harmonize Convs with MSAs: by replacing Convs at (preferable) the end of a stage. And authors also have the idea that: "uses Convs in early stages and MSAs in late stages".

Sorry in advance if these following questions of mine are dumb.

In the late stage, adding Convs after MSAs should decreases the performance of a model right? Since the late stages produces low-frequency features, and adding Convs there suppress those features? I did an experiments: I trained a hierarchical ViT, Segformer, then replace the last stage 1x1 Conv in the decoder with a 3x3 Conv (pic below) image

I trained the model on a Polyp Segmentation dataset, reported results below:

Model Dice Score
Segformer 84.95
Modified Segformer 84.61

I haven't test if replacing the 1x1 Conv in stage 1-2 with 3x3 Conv will increases the performance, but is the conclusion I made above correct?

xxxnell commented 1 year ago

Hi @iumyx2612,

Thank you for reaching out again, and sorry for the late reply. I've been extremely busy since I recently relocated to another country.

Regarding the question you raised, I would say you are probably correct, even though it is not that straightforward.

iumyx2612 commented 1 year ago

Hi @iumyx2612,

Thank you for reaching out again, and sorry for the late reply. I've been extremely busy since I recently relocated to another country.

Regarding the question you raised, I would say you are probably correct, even though it is not that straightforward.

Thank you!!