yohanshin / WHAM

MIT License
698 stars 75 forks source link

Some Questions Regarding WHAM #39

Open RohaanA opened 9 months ago

RohaanA commented 9 months ago

Hi,

Finally the colab notebook has been released, and I got to try this model on tennis footage. I was really impressed by the model's performance on tennis videos, and this model has been the most accurate model I have seen (I have been looking for 3D models that can perform well in tennis scenarios for the past few months)!

Kudos to the team for making such an excellent model! :) Here's a video of the performance of the model, albeit it took me about 2 hours on colab to produce this!

https://github.com/yohanshin/WHAM/assets/75722072/bcdcd8f0-7d74-4e4c-b1ef-67d918dd6de7

I have a few questions regarding this output from the team, which I will number down below.

  1. As you can see, the right side of the video was blank for the greater part of this video, how exactly does the model detect which person to track and show on the right side? Can I choose the player of my choosing or turn off the right display?
  2. The model is highly accurate but as I mentioned before this output took me more than 2 hours on a Nvidia T4 gpu on google colab. I understand that 3D models are heavy, and I wanted to ask if I were to arrange an instance with multiple gpus available (for instance on Azure), will the model automatically be able to use the extra GPUs to speed up the render time? Moreover, which GPU would generally be preferred for this kind of processing?
  3. I want to move this 3d animation from video to blender. Originally VIBE had a plugin which was able to do this, the same script also worked for TCMR. I haven't tried it yet, but do you think the same script will also work in converting this model's output to blender? (I am talking about this script)
  4. I only want the model to detect the tennis players, not the players outside. I have a yolov8 model that is trained on only detecting the 2 tennis players. Is it possible to use that model on the detection phase? if not, is there any other suggested alternate from the team? (One other way could be by zooming the video)

Once again, I was really impressed by the model performance, I am sure this model will set a new benchmark for other HMR models in the future :)

yohanshin commented 9 months ago

Hi @RohaanA ,

Thank you for your continuous interest in WHAM, and I am glad that it worked out finally :).

  1. The assignment of the subject to be rendered the global motion is done through this line. The current method simply selects the subject who has been shown in the video for the longest time. Maybe you can simply write a code to assign the subject you want to visualize. But as your video has multiple shots, rather than a single-camera video, I would recommend splitting the video into each shot (i.e., continuous video taken by a main camera).

  2. Yes, it takes quite a long time to run using Google Colab (and free GPU). I tested the model with various GPUs including (A100, RTX 3090, ...) those GPUs allow reasonably fast inference, this can be even faster if we run in a batch mode. I currently don't have such an implementation to support multi-gpu environment. Maybe try to install it on your local machine if you have Nvidia GPU with > 12 GB memory.

  3. I will modify the demo code to save the output in the same format as VIBE. This will allow you to run the same script, I think this will work. Try a shot later this week!

  4. If you have your yolo that works only for the players, I believe that should work. You can try modifying this line with your fine-tuned YOLO weight. Please let me know how it works!

RohaanA commented 9 months ago

Thanks :D! I'm currently out on vacations and will be back next week to try this out. If you want you can keep this issue open until then, or I can make a new one when I report my findings.

RohaanA commented 9 months ago

Hey @yohanshin , I finally got back home and got to try out the --save_pkl flag. Firstly, I noticed that using this flag the colab environment is not able to fully complete the demo script.

image

I am not exactly sure what causes this to happen, but after completing the 2D Detection and Feature Extraction the rendering part of the demo pipeline doesn't run.

However a pkl file was still created by WHAM, I tried using it using the fbx script but I got this error.

image

I haven't gone through the VIBE error, but I think It might due to the abrupt ending of the demo script.

yohanshin commented 9 months ago

Hi @RohaanA

I just updated the visualization code, there was a variable name mismatch which happened after I fixed the output file format. Please try again for the rendering.

I am not sure about the blender. I will check when I have chance to go over that part.

RohaanA commented 9 months ago

Hi @RohaanA

I just updated the visualization code, there was a variable name mismatch which happened after I fixed the output file format. Please try again for the rendering.

I am not sure about the blender. I will check when I have chance to go over that part.

Thank you for the quick response! I think some problem occurred in saving/loading of the pickle file.

RohaanA commented 9 months ago

Hey @yohanshin , I am terribly sorry the pickle load was an error on my end. I had a much older version of joblib (0.14.0), while WHAM uses 1.3.2. Updating the joblib version fixed the issue!

I was able to get the poses into blender.

image
lalasray commented 9 months ago

Hey @yohanshin , I am terribly sorry the pickle load was an error on my end. I had a much older version of joblib (0.14.0), while WHAM uses 1.3.2. Updating the joblib version fixed the issue!

I was able to get the poses into blender.

image

Hi @RohaanA can you please share your code to import the output to blender? Thanks in advance

RohaanA commented 9 months ago

Hey @yohanshin , I am terribly sorry the pickle load was an error on my end. I had a much older version of joblib (0.14.0), while WHAM uses 1.3.2. Updating the joblib version fixed the issue! I was able to get the poses into blender.

image

Hi @RohaanA can you please share your code to import the output to blender? Thanks in advance

Sure, I am using VIBE's script to convert the output. My env is on windows using a miniconda setup, here's the script. https://github.com/mkocabas/VIBE?tab=readme-ov-file#fbx-and-gltf-output-new-feature

For setting up the script you can follow the installation tutorials under that link.

image
MehranRastegarSani commented 9 months ago

@yohanshin First, I would like to express my appreciation for your fantastic work. Like @RohaanA, I used the VIB plugin and turned the PKL output of the Wham algorithm into FBX. I rendered the FBX file in Blender and Unity. The problem is that the character is always centred in all the frames. All the poses and movements of the character are happening at the same point. I have the same problem with the VIB model. Do you have any suggestions to fix this?

RohaanA commented 9 months ago

@yohanshin First, I would like to express my appreciation for your fantastic work. Like @RohaanA, I used the VIB plugin and turned the PKL output of the Wham algorithm into FBX. I rendered the FBX file in Blender and Unity. The problem is that the character is always centred in all the frames. All the poses and movements of the character are happening at the same point. I have the same problem with the VIB model. Do you have any suggestions to fix this?

This isn't actually a problem but intended behaviour since vibe does not convert positional information unlike wham. I believe you would need to add the 3d world coordinates at a later step since the script is not designed for it.

RohaanA commented 9 months ago

Hi @RohaanA ,

Thank you for your continuous interest in WHAM, and I am glad that it worked out finally :).

  1. The assignment of the subject to be rendered the global motion is done through this line. The current method simply selects the subject who has been shown in the video for the longest time. Maybe you can simply write a code to assign the subject you want to visualize. But as your video has multiple shots, rather than a single-camera video, I would recommend splitting the video into each shot (i.e., continuous video taken by a main camera).
  2. Yes, it takes quite a long time to run using Google Colab (and free GPU). I tested the model with various GPUs including (A100, RTX 3090, ...) those GPUs allow reasonably fast inference, this can be even faster if we run in a batch mode. I currently don't have such an implementation to support multi-gpu environment. Maybe try to install it on your local machine if you have Nvidia GPU with > 12 GB memory.
  3. I will modify the demo code to save the output in the same format as VIBE. This will allow you to run the same script, I think this will work. Try a shot later this week!
  4. If you have your yolo that works only for the players, I believe that should work. You can try modifying this line with your fine-tuned YOLO weight. Please let me know how it works!

Hey @yohanshin,

I finally got around to testing out that last part. It does indeed work! As you can see in the video below, now it is only detecting the 2 players and not the other people in the video. It missed the farside player at the end of the video which means our custom model still needs some improvements haha. Also I'd like to share that using our own model(which was trained on yolov8s) instead of the default yolov8x brought down the end-to-end rendering time from 24 minutes to 12 minutes!

https://github.com/yohanshin/WHAM/assets/75722072/8dcd04fb-20d9-490c-b618-1c9437838a8d

I think this resolves all the queries I had for this Issue thread. Also a bonus for anyone with their own yolov8 models, you can enable tracking by replacing model.predict with model.track in detector.py (lib/models/preproc)!

carlosedubarreto commented 9 months ago

Hello @RohaanA , I think ti might be interesting to you. I was able to make a script based on the VIBE one that can get the motion (even the translation from the PKL file generated on google colab)

you can get instructions at: https://github.com/yohanshin/WHAM/issues/52

I recorded a video showing the process on google colab (that you already know) and how to use the script for blender https://www.youtube.com/watch?v=7heJSFGzxAI

and if you have windows and wnat to install locally, I put the notes that I wrote to myself here https://github.com/yohanshin/WHAM/issues/53

Hope it helps.

RohaanA commented 9 months ago

Hello @RohaanA , I think ti might be interesting to you. I was able to make a script based on the VIBE one that can get the motion (even the translation from the PKL file generated on google colab)

you can get instructions at: https://github.com/yohanshin/WHAM/issues/52

I recorded a video showing the process on google colab (that you already know) and how to use the script for blender https://www.youtube.com/watch?v=7heJSFGzxAI

and if you have windows and wnat to install locally, I put the notes that I wrote to myself here https://github.com/yohanshin/WHAM/issues/53

Hope it helps.

Hey, That is really interesting! I'll surely try this out and put my results here! :)

jlnk03 commented 9 months ago

Hey everyone, I was experimenting with WHAM and tried converting the output to fbx. However, something seems wrong with the conversion, where the hand position does not match the output rendered from WHAM. Do you have any idea how I could fix this? I am also using the VIBE script.

https://github.com/yohanshin/WHAM/assets/95719434/cfda6dc1-b6c4-47d6-be24-d6f7f5060c37

https://github.com/yohanshin/WHAM/assets/95719434/c53b2622-995b-4a69-a39c-384403590b78

@yohanshin thank you for this great repo. Would you happen to have any suggestions on improving the accuracy for this specific task of predicting the 3D pose of a golf swing? (Finetuning on domain specific data?) As you can see in the videos the model struggles a little with the feet when there is a large twist in the body. For reference, here is the input video

https://github.com/yohanshin/WHAM/assets/95719434/06f7c227-27b3-443d-9983-9dab296d3e4e

yohanshin commented 9 months ago

I sincerely thank @carlosedubarreto and @RohaanA for your sharing various WHAM demo output and guidelines for the blender users.

Hi @jlnk03 , One way to improve the output quality is to run smplify as post-processing. I have made few changes in this new branch, where you can run demo.py with parsing --run_smplify, it will run temporal SMPLify similar to VIBE script.

This is the exemplary result when I ran the new demo on your video:

https://github.com/yohanshin/WHAM/assets/46889727/90516e5f-f33a-4626-9fe0-0296624d83e8

This new demo script improves the pixel alignment (and better 3D accuracy when I evaluated it on benchmarks), but still foot is not perfect. What I can suggest is that get 2D keypoints of foot joints (such as toes and heels) and add additional reprojection loss function using the foot joints. It will resolve the issue. Let me know how this goes.

jlnk03 commented 8 months ago

Hi @yohanshin, Running smplify definitely helps to improve the output, thank you!

I added additional loss terms for feet and hands to the forward pass of the SMPLifyLoss class (see picture), which basically just gives more weight to these specific joints. Was this what you meant? Unfortunately, this does not improve the output for me. As a reference, I plotted the 2D keypoints on the video.

Regarding the reprojection loss for the feet, how would you implement this for toes and heels since the 2D detections follow the COCO convention and only contain the ankles?

reprojection_feet_hands

https://github.com/yohanshin/WHAM/assets/95719434/cd20e3d4-cab5-427c-b231-3aa39810d1cd

I am also still struggling to get the blender output to match the prediction. Below is the blender output for the first frame of the first demo video on colab (IMG_9732.mov). As you can see the hand position but also the knee angle are not correct. I am not quite sure if this is the way WHAM outputs the joints or if it is a mistake on my side. Maybe @RohaanA has an idea?

blender_demo_front blender_demo_side
MehranRastegarSani commented 8 months ago

@yohanshin @carlosedubarreto @RohaanA Hi, Thanks again for your fantastic job and contributions. I tried some videos that were captured from 360 cameras (insta360 x2) and the results were better than I expected. However, the issue is that when using equrectangular video Wham failed to detect the correct position of the person in the real world. In other words, the 'trans' parameter calculated by Wham is wrong for the 360 videos. Do you have any idea how I can calculate this parameter correctly for 360 cameras?

carlosedubarreto commented 8 months ago

hello @MehranRastegarSani , I dont have experience with that part, but I remember there is a way to change the camera config, I think. (its my guess)

Maybe that is something that could help to get better result. the part that tells about calibration is on the main page, here is a screenshot

image