xxxnell / how-do-vits-work

(ICLR 2022 Spotlight) Official PyTorch implementation of "How Do Vision Transformers Work?"
https://arxiv.org/abs/2202.06709
Apache License 2.0
798 stars 77 forks source link

What factors determine if a model or a layer behaves like a low- or high-pass filter? #31

Closed waitingcheung closed 1 year ago

waitingcheung commented 1 year ago

Your paper reports that generally MSAs behave like low-pass filters (shape-biased) and Convs behave like high-pass filters (texture-biased). Recently I came across papers that report shape bias in their findings and I wonder about your thoughts on them.

Low-pass filters (shape-biased)

High-pass filters (texture-biased)

These findings suggest that factors affecting the behavior can be spatial aggregation, kernel size, training data, or training procedures. It seems that only 3x3 Convs behave like high-pass filters or I may be missing something. In your another thread you mentioned that group size also makes a difference. I wonder how ResNet and ResNeXt differ and I suppose ResNeXt is also texture-biased.

I will appreciate your insights on what factors determine if a model or a layer behaves like a low- or high-pass filter.

xxxnell commented 1 year ago

Hi @waitingcheung,

The question you pointed out is certainly interesting and important. However, unfortunately, it is difficult to answer in one word. I believe spatial aggregation, depthwise separable operations, large kernel size, shape-biased training datasets all increase the shape-biases of neural nets. I also would expect vanilla CNNs/ViTs to be one of the most texture/shape-biased models. Thus, we can spectralize the shape-biases of vanilla CNNs, hybrid models, and vanilla ViTs (instead of dichotomizing them).