Closed sarsigmadelta closed 10 months ago
Hi @sarsigmadelta , thank you for your interest in our work! We will be pushing the Huggingface and Google Colab demos in January, which include one-shot installation and inference. Please stay tuned and we will post an update when it's ready!
Hi @yohanshin, very impressive results on Human pose 3D predictions. Looking forward to demos to run inference in the wild
demos
thanks for reply, very Looking forward to
Looking forward to it too
@yohanshin thanks for your work. I wonder what's the specific image encoder that you used to extract the visual feature? In your paper, you mentioned 4 reference works, but they all use different image encoder. Would be great if you can send here the gitub link of the image encoder that your method use. Thanks
Hi @khanhha , thank you for your question!
As mentioned in the paper, we tested WHAM with 3 different architectures for image encoder. One is Resnet-50, where we borrowed the pretrained weight from SPIN. The other is HRNet-48, and we used CLIFF for this. However, instead of using the pretrained weight provided by CLIFF's official repo, we uses the one provided in the BEDLAM project as this one is not trained on 3DPW (to fairly compare across encoder architectures). The last one is ViT and we borrowed the weight from HMR2.0. Hope this answers your question and let me know if you need any further clarification.
hi @yohanshin, thanks for pointing that out. I missed the part in your paper. Apprecitate your explaination.
@yohanshin just another question to clarify it. the HMR2.0 repos is pretty old and doesn't use ViT backbone. Did I miss something there? Can u again confirm whats the github repos that you use for the experiment WHAM-B (ViT)∗?
@yohanshin lovely, thanks very much. I wonder if you can help me another question regarding inference on video. Your code requires initial SMPL6d pose, initial 6D root, initial 3D keypoints of the first frame to start the inference. I wonder from what reference/method do you plan to achieve this information?
@khanhha The use of initial pose (SMPL pose parameter and 3d keypoints) is to build neural initialization for RNNs. Please refer to our supplementary materials (A.3). This idea was inspired by PIP (section 3.1.2).
@yohanshin thanks, I see what you mean. But what I mean is that what's the method/approach do u plan to use to extract the initial SMPL pose from the first frame of a video? In your evaluation code, the initial SMPL pose is taken from the dataset groundtruth, but this information is not available in the wild video.
@khanhha I see what you mean. In the evaluation code, we DO NOT use GT, but used the estimation result from an image-based model. To be specific, WHAM (ViT) used HMR2.0 to get 0th frame pose. CLIFF and SPIN for the other image encoder architectures respectively. Does this clarify your question?
@yohanshin everything is clear now. Thanks a lot for your time. Best wishes!
hi @yohanshin, I would like to ask another question. Im still struggling to achieve a good result from your method. So far the output from your method still looks not as good as the result from Metrabs. It seems the result is too smooth. I hope that there might be something wrong in my code.
Could you help me clarify the format of the following input arguments.
inits: a tuple of 3d_2d keypoint and smpl 6d joints. What is the format of the 2d keypoint array and 3d keypoint array?So far I assume they are in COCO17 format.
frms_feats: n_frmsx1024 feature vectors extracted from 4D-human. I use the variable token_out from their code as its the only output that has 1024 dimension, as shown below.
init_root: 6d init root orientation.
cam_angvel: in my case, it's static camera, so I set all to zero.
boxes: n_frmsx3: center_x, center_y and scale.
Can help me confirm about the correctness of these parameters. Look forward to hearing from you soon. Thanks
Hi @khanhha , we will release the demo code with a custom video very soon (this week) and you will be able to debug each component from that.
inits: Yes both are in COCO format. The 2D keypoints part needs to be in normalized form (37-dimension after normalization), and the 3D keypoints part needs to be centered on the pelvis (i.e., mid point of two hip joints)
frms_feats: Yes, one thing that I might do different from HMR2.0 is that I didn't use mean_params during obtaining token
.
Yes, I think you are doing it correctly. Can you share your video with me (via email)? Or if you find some public video that you can try WHAM and share it with me, we can debug simultaneously and get some ideas of overly-smoothed results. You can shoot me an email if that works for you.
@yohanshin thanks for your answer. So excited to hear that. I am curious to see how the result from your approach compared to ones from Metrabs. Thats so exiciting field research problem. I will wait until your demo is released, run the tes again and let you know about the result. Best
I close this issue as the custom video demo is now implemented. Please reopen if you have any other concerns.
thanks for sharing such a great work, when custom video inference could be supported?