Why is a small number of parameters require a very large memory?

yangsenius / TransPose

PyTorch Implementation for "TransPose: Keypoint localization via Transformer", ICCV 2021.

https://github.com/yangsenius/TransPose/releases/download/paper/transpose.pdf

MIT License

360 stars 58 forks source link

Why is a small number of parameters require a very large memory? #19

Closed FlyuZ closed 3 years ago

FlyuZ commented 3 years ago

I found that this model is much smaller than the parameter or the amount of operation is much smaller than HRNET, but the memory occupied by the training is particularly large. Why is this?Is this the characteristics of VIT? Thank you for your answer.

yangsenius commented 3 years ago

Hi, @FlyuZ. The number of parameters of this model is smaller than HRNet, but the calculation amount and occupied memory are usually larger than it. You are right. This can be attributed to the characteristics of Transformer. Self-attention computes pairwise inner product between pairwise input contents with only needing few model weight parameters, while CNN mainly computes matrix multiplications between input contents and convolution kernel weights.