yangsenius / TransPose

PyTorch Implementation for "TransPose: Keypoint localization via Transformer", ICCV 2021.
https://github.com/yangsenius/TransPose/releases/download/paper/transpose.pdf
MIT License
361 stars 58 forks source link

Implementation of MPII set #18

Closed wangdong0556 closed 3 years ago

wangdong0556 commented 3 years ago

Hello , I am very happy to see such excellent work. How can I use this project to train and test in MPII? I adjusted some parameters, but the result is not very good. Have you done any work on this data set?

In addition, what is the basis for determining the number of heads for different models?

Thank you!

yangsenius commented 3 years ago

Thanks for your interest!

This repo is based on DarkPose. You only need to change some configs of experiment Yaml files for MPII. See darkpose for reference.

I adjust the number of heads according to the dimension of query key vectors, in order to keep the dimension in each head not very large.

For ResNet-S based, d=256, then n_heads = 8 = 256 // 32. For HRNet-S based, d=96, then n_heads = 1 = 96 // 96.

Also, for TransPose-H, using fewer heads is to consume less GPU memory, because we conduct self-attention on 1/4 input resolution.

wangdong0556 commented 3 years ago

“For ResNet-S based, d=256, then n_heads = 8 = 256 // 32. For HRNet-S based, d=96, then n_heads = 1 = 96 // 96.” The values of RESNET and HRNet are 32 and 96 respectively. What are the meanings of these values(32 and 96)? Are they 96 for hrnet-s-w32 and w48?

yangsenius commented 3 years ago

Actually, they have no special meanings. 64 for HRNet-W32, 96 for HRNet-w48

yangsenius commented 3 years ago

The output feature map channel for ResNet is 512, so we set the d_model to be 256; The output feature map channel for HRNet is 32 or 48, so we set the d_model to be 64 or 96.

wangdong0556 commented 3 years ago

Thanks!Why is the d_model set to one-half of the output feature map channel, is it a fixed setting or is there some other reason?

yangsenius commented 3 years ago

There is no special reasons for halving or doubling the channels. We want the channel transformation to be of the same order of magnitude.