yuhuan-wu / P2T

[TPAMI22] Pyramid Pooling Transformer for Scene Understanding
200 stars 18 forks source link

Questions about your ablation studies #1

Closed pp00704831 closed 3 years ago

pp00704831 commented 3 years ago

Hello,

I have some questions about your ablation studies of pyramid pooling. Could you detail about your baseline version in Table 9? First, you say that you replace P-MHSA with an MHSA with a single pooling operation, what is the detail about single pooling operation? Ex: Pooling Ratios? Second, do you compared your method with original MHSA?

yuhuan-wu commented 3 years ago

We cannot directly train a model with the original MHSA. As described in the introduction of our paper, the computational cost and memory usage of original MHSA is quadratic to the image size. The memory usage can even exceed to the NVIDIA A100 limit. Maybe original MHSA can achieve better result, but it is impossible to implement it under the current hardware situation. BTW, for comparison, we replace P-MHSA with a modified MHSA that uses average pooling operations for generating k and v, and the pooling ratios are the smallest one in the P-MHSA.

pp00704831 commented 3 years ago

Thank you for your reply. So how about the strides for pooling? For example, when pooling ratio equals to 12, it means that you use kernel size = 12, and strides = 12 ?

yuhuan-wu commented 3 years ago

Thank you for your reply. So how about the strides for pooling? For example, when pooling ratio equals to 12, it means that you use kernel size = 12, and strides = 12 ?

Almost right. However, this may cause that some pixels are not taken into account without padding. Regarding this issue, we use adaptive kernel size, stride, and padding to make sure that all pixels are for computation.