yohanshin / WHAM

MIT License
656 stars 71 forks source link

Can camera real-time inference be performed on features and 3D joint positions? #49

Open psiydown opened 8 months ago

psiydown commented 8 months ago

1.Can faster methods be used for feature extraction, and currently, a single batch of feature extraction can only be done at 15fps?

2.Can we obtain 3D joint positions in the Integrate features stage without using SMPL and extract them in context? What is the format?

3.Also, can the smoothing filter parameters be adjusted? If it is too smooth by default, the details of small actions cannot be seen.

4.Can it support both palm and toe inference simultaneously?

yohanshin commented 7 months ago

Hi @psiydown

  1. If you train with a light-weighted image encoder, you can make the feature extractor faster. The default encoder, ViT, runs faster than 15 fps (excluding image loading and processing), might be dependent on the computing resources.

  2. The motion encoder estimates 3D joint positions without using SMPL and it follows the format of COCO-17 joints.

  3. Currently smoothing filter is only applied for the bounding box and I believe this will not hinder the network to capture detailed dynamics.

  4. The current version does not support for the hand/foot joints, but once we release the training code, you can train your own model with more expresiveness (i.e., SMPL-X with hand, foot, and face joints).

psiydown commented 7 months ago

Hi @yohanshin

  1. I found that using the default hmr2a model for feature extraction yields good results, while using the hmr2b model yields poor results. Has the model been optimized and trained for the hmr2a model, and is there a readily available and effective alternative feature extraction model?

  2. What I mean is whether it is possible to obtain 3D joint positions before decoding after feature integration. I have tested that obtaining 3D joint positions directly without feature integration and decoding during the encoding stage is not accurate (For example, if the hand is backward, it will become forward). What is the format of the context and can 3D joint positions be obtained within the context?

  3. Yes, I can't find the code for joint smoothing. The inference output is already a smooth pose. Have you implemented a smoothing algorithm within the model, such as SmoothNet?

  4. When will the training code be released approximately, Do you plan to include training for your palms and toes?

Thank you for your detailed answer.

yohanshin commented 7 months ago

Hi @psiydown

  1. The provided model is trained with HMR2.0 a and thus, this is not appropriate way to switch image encoder with HMR2.0 b. I have trained WHAM which uses Resnet-50 image encoder (trained following SPIN) and this is faster but less accurate.

  2. Thanks for pointing this out. The intuition of using 3D joint positions in the encoding stage is to explicitly add "3D" information from the motion feature that is extracted from the sequence of 2D keypoints. So the format of the motion context is the concatenation of feature from 2D keypoints sequence and 3D joint positions. I haven't actively analyzed the quality of intermediate 3D joints though but it would be interesting to see how performance can be improved when we construct motion context with better 3D joint positions.

  3. I did not utilize any pose smoothing algorithm for WHAM. The reason why we can observe smooth prediction is that we trained WHAM on the large-scale motion data (AMASS) and thus, the motion encoder and decoder learn human motion prior.

  4. I just started to clean my codebase to release the training framework. I won't include the expressive model but this can be easily implemented if you use SMPLify-X as the post-processing.

Tiandishihua commented 7 months ago

1.Can faster methods be used for feature extraction, and currently, a single batch of feature extraction can only be done at 15fps?

2.Can we obtain 3D joint positions in the Integrate features stage without using SMPL and extract them in context? What is the format?

3.Also, can the smoothing filter parameters be adjusted? If it is too smooth by default, the details of small actions cannot be seen.

4.Can it support both palm and toe inference simultaneously?

Have you achieved real-time mocap?

psiydown commented 7 months ago

No, the feature extraction part can only run at 15FPS in RTX3060. I tried to accelerate the model with half precision inference, but the author only trained the WHAM model with a full precision feature model, which resulted in inaccurate final estimation of the WHAM action after half precision acceleration! But in 4D Humans, using a half precision model has almost no impact on action accuracy. @yohanshin Can you retrain the WHAM model with a feature model for half precision inference 'model.half()' to achieve real-time inference? thanks

Tiandishihua commented 7 months ago

No, the feature extraction part can only run at 15FPS in RTX3060. I tried to accelerate the model with half precision inference, but the author only trained the WHAM model with a full precision feature model, which resulted in inaccurate final estimation of the WHAM action after half precision acceleration! But in 4D Humans, using a half precision model has almost no impact on action accuracy. @yohanshin Can you use a feature model half precision inference 'model.half()' to retrain in WHAM model and achieve real-time inference? thanks

Thank you very much for your help.

gpastal24 commented 1 week ago

Hi @yohanshin

Could you provide the R50 and HRNet variations? I expect R50 will be much faster than ViT-H. Also if I train HMR2 with a smaller ViT backbone, will I have to retrain your model as well ?