Closed waitingcheung closed 1 year ago
Hi @waitingcheung,
The question you pointed out is certainly interesting and important. However, unfortunately, it is difficult to answer in one word. I believe spatial aggregation, depthwise separable operations, large kernel size, shape-biased training datasets all increase the shape-biases of neural nets. I also would expect vanilla CNNs/ViTs to be one of the most texture/shape-biased models. Thus, we can spectralize the shape-biases of vanilla CNNs, hybrid models, and vanilla ViTs (instead of dichotomizing them).
Your paper reports that generally MSAs behave like low-pass filters (shape-biased) and Convs behave like high-pass filters (texture-biased). Recently I came across papers that report shape bias in their findings and I wonder about your thoughts on them.
Low-pass filters (shape-biased)
High-pass filters (texture-biased)
These findings suggest that factors affecting the behavior can be spatial aggregation, kernel size, training data, or training procedures. It seems that only 3x3 Convs behave like high-pass filters or I may be missing something. In your another thread you mentioned that group size also makes a difference. I wonder how ResNet and ResNeXt differ and I suppose ResNeXt is also texture-biased.
I will appreciate your insights on what factors determine if a model or a layer behaves like a low- or high-pass filter.