What factors determine if a model or a layer behaves like a low- or high-pass filter?

xxxnell / how-do-vits-work

(ICLR 2022 Spotlight) Official PyTorch implementation of "How Do Vision Transformers Work?"

Apache License 2.0

798 stars 77 forks source link

Your paper reports that generally MSAs behave like low-pass filters (shape-biased) and Convs behave like high-pass filters (texture-biased). Recently I came across papers that report shape bias in their findings and I wonder about your thoughts on them.

Low-pass filters (shape-biased)

MSAs (global or local)
Large kernel Convs (ConvNext or RepLKNet)
ResNet or DeiT trained on stylized ImageNet
Masked Image Modeling

High-pass filters (texture-biased)

3x3 Convs in ResNet

These findings suggest that factors affecting the behavior can be spatial aggregation, kernel size, training data, or training procedures. It seems that only 3x3 Convs behave like high-pass filters or I may be missing something. In your another thread you mentioned that group size also makes a difference. I wonder how ResNet and ResNeXt differ and I suppose ResNeXt is also texture-biased.

I will appreciate your insights on what factors determine if a model or a layer behaves like a low- or high-pass filter.

xxxnell / how-do-vits-work

What factors determine if a model or a layer behaves like a low- or high-pass filter? #31