Why using ReLU before split-attention?

I have read your codes about split-attention and I found that you use ReLU before split-attention. https://github.com/zhanghang1989/ResNeSt/blob/76debaa9b9444742599d104609b8ee984b207332/resnest/torch/splat.py#L48-L76 However, in MobileNetV3,CBAM-ResNet or other models which use attention mechanism, activation is usually added after attention mechanism. Have you ever tried it and found some decrease in performance? https://github.com/d-li14/mobilenetv3.pytorch/blob/08bcb5294a4d63a23d64b0e0371bffaa4abeff36/mobilenetv3.py#L107-L121 https://github.com/luuuyi/CBAM.PyTorch/blob/72bd6930d9c060236235849879f6ddb938c7533c/model/resnet_cbam.py#L78-L94

Emmmmm, I find that in GhostNet, the author does not use ReLU before attention mechanism and even not use it after attetntion mechanism.
https://github.com/iamhankai/ghostnet.pytorch/blob/85929d038295ea3090d5915b948f8a525e0c38f8/ghost_net.py#L86-L95 *What's your opinion about using ReLU before split-attention? From my perspective, as you use 11 convolution, BN an ReLU to replace fully-connected layer, it may be better to keep the same architecture.** Hopefully to your reply.

zhanghang1989 / ResNeSt

Why using ReLU before split-attention? #53