yhw-yhw / TalkSHOW

This is the official repository for TalkSHOW: Generating Holistic 3D Human Motion from Speech [CVPR2023].
301 stars 26 forks source link

Possible Miscalculation of MSELoss in Face Generator #20

Open cchadj opened 1 year ago

cchadj commented 1 year ago

I would like to report a possible miscalculation of the loss in the face generator.

Issue description

Please have a look at the following code snippet: https://github.com/yhw-yhw/TalkSHOW/blob/38aab300b0aba6fc631ad139f62a6cea87261a0c/nets/smplx_face.py#L155-L159

I believe the loss calculation at line 155 is wrong. The slice should go up to index 3, not 6. That's because the dimensions for the jaw_pose are 3.

I would like to remind you that pred_poses shape is (N, seq_length, 103), where the first 3 dimensions are for the jaw_pose while the rest 100 are for the expression.

For the gt_poses the shape is (N, seq_length, 265) where the first 3 dimensions are for the jaw pose and the last 100 are for expression. The 3 next dimensions after the first 3 of the jaw pose are for the left eye.

When we do MSELoss = torch.mean(torch.abs(pred_poses[:, :, :6] - gt_poses[:, :, :6])) we compare correctly the first 3 jaw_pose features but also we compare 3 left eye features from gt_poses with 3 features expression features from pred_poses.

Proposed Fix:

I believe the correct way to calculate the loss is by changing 6 to 3, as follows: MSELoss = torch.mean(torch.abs(pred_poses[:, :, :3] - gt_poses[:, :, :3])).

Please let me know if my assertion is correct or whether I misunderstood something.