yuhuan-wu / P2T

[TPAMI22] Pyramid Pooling Transformer for Scene Understanding
200 stars 18 forks source link

Pooling layers vs. Conv layers #16

Closed clelouch closed 1 year ago

clelouch commented 1 year ago

Thanks for your great work.

I found in tab7, the avg pooling layer is much better than the max pooling and depth-wise conv. However, a depthwise conv layer can also act as a pooling layer when all parameters are equal. It seems that the convlayer version should perform better, which contradicts with the experimental results.

Can you explain the result?

Thank u

yuhuan-wu commented 1 year ago

Yes. Intuitively, in most cases convolution works better than directly pooling without parameter updates.

However, in our case, we have found that depthwise convolution works even worse than (adaptive) average pooling with the same kernel size. When I ran the above experiments in 2021, I thought that the parameters of a large-kernel depthwise convolution are harder to optimize due to the very large kernel size applied to depthwise convolution.

Recently (2022 - 2023), some approaches on large kernel convolution have also shown that depthwise convolution with a much larger kernel size works even worse, which validates my hypothesis.

Moreover, pooling is much simpler and more flexible than convolutions, in which the settings (e.g. kernel size) cannot be changed for different input sizes. Average pooling can support diverse input sizes while convolutions cannot due to the fixed kernel size, stride, and padding, etc.

Has my explanation solved your question?

clelouch commented 1 year ago

Thanks so much for your kind help.