custom video inference.

sarsigmadelta commented 11 months ago

thanks for sharing such a great work, when custom video inference could be supported?

yohanshin commented 10 months ago

Hi @sarsigmadelta , thank you for your interest in our work! We will be pushing the Huggingface and Google Colab demos in January, which include one-shot installation and inference. Please stay tuned and we will post an update when it's ready!

vuxminhan commented 10 months ago

Hi @yohanshin, very impressive results on Human pose 3D predictions. Looking forward to demos to run inference in the wild

sarsigmadelta commented 10 months ago

demos

thanks for reply, very Looking forward to

linjiangya commented 10 months ago

Looking forward to it too

khanhha commented 10 months ago

@yohanshin thanks for your work. I wonder what's the specific image encoder that you used to extract the visual feature? In your paper, you mentioned 4 reference works, but they all use different image encoder. Would be great if you can send here the gitub link of the image encoder that your method use. Thanks

yohanshin commented 10 months ago

Hi @khanhha , thank you for your question!

As mentioned in the paper, we tested WHAM with 3 different architectures for image encoder. One is Resnet-50, where we borrowed the pretrained weight from SPIN. The other is HRNet-48, and we used CLIFF for this. However, instead of using the pretrained weight provided by CLIFF's official repo, we uses the one provided in the BEDLAM project as this one is not trained on 3DPW (to fairly compare across encoder architectures). The last one is ViT and we borrowed the weight from HMR2.0. Hope this answers your question and let me know if you need any further clarification.

khanhha commented 10 months ago

hi @yohanshin, thanks for pointing that out. I missed the part in your paper. Apprecitate your explaination.

khanhha commented 10 months ago

@yohanshin just another question to clarify it. the HMR2.0 repos is pretty old and doesn't use ViT backbone. Did I miss something there? Can u again confirm whats the github repos that you use for the experiment WHAM-B (ViT)∗?

yohanshin commented 10 months ago

@khanhha Sure! Please refer to this 4D-Humans repo.

khanhha commented 10 months ago

@yohanshin lovely, thanks very much. I wonder if you can help me another question regarding inference on video. Your code requires initial SMPL6d pose, initial 6D root, initial 3D keypoints of the first frame to start the inference. I wonder from what reference/method do you plan to achieve this information?

yohanshin commented 10 months ago

@khanhha The use of initial pose (SMPL pose parameter and 3d keypoints) is to build neural initialization for RNNs. Please refer to our supplementary materials (A.3). This idea was inspired by PIP (section 3.1.2).

khanhha commented 10 months ago

@yohanshin thanks, I see what you mean. But what I mean is that what's the method/approach do u plan to use to extract the initial SMPL pose from the first frame of a video? In your evaluation code, the initial SMPL pose is taken from the dataset groundtruth, but this information is not available in the wild video.

yohanshin commented 10 months ago

@khanhha I see what you mean. In the evaluation code, we DO NOT use GT, but used the estimation result from an image-based model. To be specific, WHAM (ViT) used HMR2.0 to get 0th frame pose. CLIFF and SPIN for the other image encoder architectures respectively. Does this clarify your question?

khanhha commented 10 months ago

@yohanshin everything is clear now. Thanks a lot for your time. Best wishes!

khanhha commented 10 months ago

hi @yohanshin, I would like to ask another question. Im still struggling to achieve a good result from your method. So far the output from your method still looks not as good as the result from Metrabs. It seems the result is too smooth. I hope that there might be something wrong in my code.

Could you help me clarify the format of the following input arguments.

inits: a tuple of 3d_2d keypoint and smpl 6d joints. What is the format of the 2d keypoint array and 3d keypoint array?So far I assume they are in COCO17 format.
frms_feats: n_frmsx1024 feature vectors extracted from 4D-human. I use the variable token_out from their code as its the only output that has 1024 dimension, as shown below.
init_root: 6d init root orientation.
cam_angvel: in my case, it's static camera, so I set all to zero.
boxes: n_frmsx3: center_x, center_y and scale.

Can help me confirm about the correctness of these parameters. Look forward to hearing from you soon. Thanks

yohanshin commented 10 months ago

Hi @khanhha , we will release the demo code with a custom video very soon (this week) and you will be able to debug each component from that.

inits: Yes both are in COCO format. The 2D keypoints part needs to be in normalized form (37-dimension after normalization), and the 3D keypoints part needs to be centered on the pelvis (i.e., mid point of two hip joints)

frms_feats: Yes, one thing that I might do different from HMR2.0 is that I didn't use mean_params during obtaining token.

Yes, I think you are doing it correctly. Can you share your video with me (via email)? Or if you find some public video that you can try WHAM and share it with me, we can debug simultaneously and get some ideas of overly-smoothed results. You can shoot me an email if that works for you.

khanhha commented 10 months ago

@yohanshin thanks for your answer. So excited to hear that. I am curious to see how the result from your approach compared to ones from Metrabs. Thats so exiciting field research problem. I will wait until your demo is released, run the tes again and let you know about the result. Best

yohanshin commented 10 months ago

I close this issue as the custom video demo is now implemented. Please reopen if you have any other concerns.

yohanshin / WHAM

custom video inference. #7