about 3d landmark representation

yerfor / GeneFace

GeneFace: Generalized and High-Fidelity 3D Talking Face Synthesis; ICLR 2023; Official code

MIT License

2.56k stars 296 forks source link

about 3d landmark representation #69

Closed flyingshan closed 1 year ago

flyingshan commented 1 year ago

Hi, I have a question about the choice of the face representation. In the paper, I saw that the 68 3d-landmarks are chosen to represent the face motion. But 3d landmarks are person-specific as the shape of the human faces are differ from each other. As the expression bases from the 3DMM coefs are independent from the face shape, I wonder why they are not chosen as the face motion representation. Is this due to the 3DMM expression coefs are harder to adapt using the domain-adapt post net? Have you conduct experiments on this? Looking forward to your reply!

yerfor commented 1 year ago

Hi, thanks for pointing this out!

Actually, in an early version of GeneFace, we use expression code as the intermediate between the audio2motion and NeRF. However we found predicting expression code ( which is not in Euclidean space) in audio2motion is unstable and the domain adapt fails frequently. So in the final version we use lm3d instead.

yerfor commented 1 year ago

Besides, lm3d is much more interpretable than the expression code, and you can semantically edit it (eg, you could manually control the eye blink by setting the value of eye-part landmarks)

yerfor commented 1 year ago

Btw, we also tried to use exp_lm3d= expression*exp_basis as the motion representation. Interestingly, we found that 3DMM expression code cannot fully represent the facial motion, and identity code is necessary to represent the eye blink motion. Based on this observation, we use idexp_lm3d=identity*id_basis+exp*exp_basis as the motion representation, which we found could achieve the best performance.

flyingshan commented 1 year ago

Thank you for your explanation!

To provide more information, for the task of audio2expression, I found sadtalker had successfully trained a model to predict expression code from audio. But as you mentioned, the domain adaption might be difficult as the expression coef is not in euclidean space (p.s. It means the adaptation for some videos will fail? But 3DMM expression coef are not person-specific, do we need to adapt 3DMM expression coefs?).

flyingshan commented 1 year ago

And what do you mean by using idexp_lm3d=identityid_basis+expexp_basis as the motion representation? Do you mean the face motion is represented by the linear combination of the vertex coordinates of the mesh from the face basis, as the original 3DMM algorithm? And what does the id_basis represent for?

Looking forward to your reply!

yerfor commented 1 year ago

Hi, Q: (p.s. It means the adaptation for some videos will fail? But 3DMM expression coef are not person-specific, do we need to adapt 3DMM expression coefs?). A: Yes, it fails in some frames. You need to adapt 3DMM expression coefs, since the input space of NeRF only have several thousands of data points of expression, which is a small subset of expression coefs space.

Q:And what do you mean by using idexp_lm3d=identityid_basis+expexp_basis as the motion representation? Do you mean the face motion is represented by the linear combination of the vertex coordinates of the mesh from the face basis, as the original 3DMM algorithm? And what does the id_basis represent for? A: yes, we reconstruct the mesh based on the id and exp code, then select 68 key landmarks from it. You can refer to data_util/face3d_helper.py for more details.

flyingshan commented 1 year ago

Got it, thank u!