y-zheng18 / GIMO

Official repo of our ECCV 2022 paper "GIMO: Gaze-Informed Human Motion Prediction in Context"
78 stars 6 forks source link

Obtain the smplx parameters in scene coordinates, confusing points about align_smpl.py and eval_dataset.py #2

Open EricGuo5513 opened 1 year ago

EricGuo5513 commented 1 year ago

Hi, thanks for your great assets. I am trying to align the smplx parameters with the scene world. I have a few questions after looking into eval_dataset.py and align_smpl.py. I am not an expert in 3D transformation, hope you don't mind my "silly" questions:

I feel some parts in eval_dataset.py and align_smpl.py are not consistent:

  1. How could we get the smpl.obj from the smplx.pkl? Should we pass the smplx.pkl to vposer to otain pose parameters, and then use smplx to obtain the shape vertices. I get some related codes in utils/vis_utils.py. But not very sure if there are further transformations.
  2. In align_smpl.py, the smpl.obj are loaded, rescaled (become larger), and then transformed to scene coordinates using pose2scene RT matrix. While in eval_dataset, we do nothing to smpl.obj, but to rescale and transform the smpl parameters (specifically the global orientation and translation). Also in eval_dataset, we load the scene_downsampled.ply, instead of the textured mesh. Some other confusing parts includes:

I know these are A LOT of questions. I would greatly appreciate it if you could help clarify this. I believe these can also be helpful for other beginner to use this dataset.

y-zheng18 commented 1 year ago

Hi, thanks for asking! For the 1st question, yes, the obj files are obtained from pkl files. No transformations are needed.

For question 2, scene_downsampled.ply is downsampled from textured_output.obj. You can simply open them together in meshlab to verify that. For the transformations, essentially, there are 3 coordinate systems: the smplx coordinate system (from mocap device), the gaze coordinate system (from hololens2), the scene coordinate system (from 3D scanner). In align_smpl.py, what we are doing is transforming smplx to the scene coordinate space: $$X{scene} = W{p2s}sXp$$ where s is the scale factor, and $W{pose2scene}=[R|t]$ transforms the scaled smplx vertex $X_p$ to the 3D scene space. In eval_dataset.py, basically we are doing similar things, but there are several points to pay attention to: (1) the 3D scene points are transformed into the canonical space using transform_norm.txt such that the pointnet++ backbone can extract more informative features. Thus, aligning smplx to the transformed scene is as follows: $$X_{scene}' = WnX{scene}=WnW{p2s}sX_p=[R_n|t_n][R|t]sXp$$ Note that this equals to: $$1/sX{scene}' = [R_n|t_n/s][R|t/s]X_p$$ (2) In eval_dataset.py we don't use smplx vertex $X_p$, and instead we use the global translation, orientation and latent vector to represent smplx poses. So we transform the global orientation $R_g$ and translation $t_g$ using the above equation, since $X_p=[R_g|tg]X{local}$.

So you can see in eval_dataset.py we don't scale the smplx, but we rescale the 3D scene (we don't scale global translation and orientation). That's why 1/scale. While in align_smpl.py, we simply rescale smplx vertex since it's easier and we only want to visualize. Both transformations aim to align smplx and the scene and essentially they are doing the same thing.

EricGuo5513 commented 1 year ago

Hi, I really appreciate your detailed reply. It helps me a lot.