yoyo-nb / Thin-Plate-Spline-Motion-Model

[CVPR 2022] Thin-Plate Spline Motion Model for Image Animation.
MIT License
3.44k stars 558 forks source link

Possibility of reusing some of the facial motion and amplifying movement of selective areas. #14

Open ovshake opened 2 years ago

ovshake commented 2 years ago

Hi, Congratulations on such a wonderful research project and thank you for making it open source. I have a few questions on which I want to get your feedback for possible approaches.

  1. Is it possible to reuse some of the computations. My use-case consists of running on a few videos but with different lip movements. All the other parts of the face consist of the same movement. Is there a way to pre-compute these facial movements and only compute the mouth movements at any given run?
  2. Is it possible to amplify the lip movement? Currently, the movement scale is applied to the entire image. Is there a way to apply it only to the mouth area? I tried isolating the key points which detect the mouth, but they sometimes switch to tracking something else other than the mouth which causes the mouth to distort? Looking forward to hearing your thoughts on this.

Thanks!

yoyo-nb commented 2 years ago

Hello.

The two questions can be solved with the same solution.

This work is an unsupervised method, so no prior knowledge is introduced. If you want to have better performance on human faces, you can use pre-trained face landmarks detector instead of Keypoint Detector in this work, and choose K*5 points from face landmarks and manually design which points to use as a group to compute TPS transformations. After that retrain the rest of the network modules (Dense Motion Network, Inpainting Network and also include the BG Motion Predictor if there is background motion in your training data).

For example, you can use five points selected in the upper lip region as a group to calculate the TPS transformation, and similarly, you can select 5 points from each of the lower lip, left eyebrow, left eye, right eyebrow, right eye and nose regions as a group.

And when inferring, if you only want to transfer the motion of the lip region, you can calculate the TPSs corresponding to the upper and lower lip only, while the TPSs of other regions can be calculated in advance and applied to each frame of the generated video.

ovshake commented 2 years ago

Thanks for the reply! I was wondering if there is a way to do this without re-training the model? Why is retraining the dense motion network needed, as it can is trained on the task of predicting the motion from the TPS transformations, does it matter if the points used for TPS transformations come from self-supervised way or if we select them manually?

yoyo-nb commented 2 years ago

I don't think this can be done without retraining the model.

Because the current model learns by unsupervised method, the contribution of each TPS to the optical flow is not concentrated in one region like FOMM, but scattered in various locations on the face. And good motion transfer results can only be obtained when multiple TPSs cooperate. If the keypoints are designed manually without retraining the dense motion network, according to my understanding, the dense motion network will still predict the contribution maps scattered in various locations as before, resulting in a potentially poorer calculated optical flow.

If these keypoints are designed manually and the dense motion network is retrained, the contribution map of each TPS will be concentrated in one region and is meaningful (I have experimented with it). And transfer some regions of the motion can be done according to this.

simasima121 commented 1 year ago

Hi, Congratulations on such a wonderful research project and thank you for making it open source. I have a few questions on which I want to get your feedback for possible approaches.

1. Is it possible to reuse some of the computations. My use-case consists of running on a few videos but with different lip movements. All the other parts of the face consist of the same movement. Is there a way to pre-compute these facial movements and only compute the mouth movements at any given run?

2. Is it possible to amplify the lip movement? Currently, the movement scale is applied to the entire image. Is there a way to apply it only to the mouth area? I tried isolating the key points which detect the mouth, but they sometimes switch to tracking something else other than the mouth which causes the mouth to distort?
   Looking forward to hearing your thoughts on this.

Thanks!

Awesome idea - did you figure this out?

Chromer163 commented 1 year ago

mark

ak01user commented 1 year ago

mark