Closed FlyuZ closed 3 years ago
Hi, @FlyuZ. The number of parameters of this model is smaller than HRNet, but the calculation amount and occupied memory are usually larger than it. You are right. This can be attributed to the characteristics of Transformer. Self-attention computes pairwise inner product between pairwise input contents with only needing few model weight parameters, while CNN mainly computes matrix multiplications between input contents and convolution kernel weights.
I found that this model is much smaller than the parameter or the amount of operation is much smaller than HRNET, but the memory occupied by the training is particularly large. Why is this?Is this the characteristics of VIT? Thank you for your answer.