zhangchenxu528 / FACIAL

FACIAL: Synthesizing Dynamic Talking Face With Implicit Attribute Learning. ICCV, 2021.
GNU Affero General Public License v3.0
371 stars 84 forks source link

Talking Head Avatar: Optimum Script for Training #66

Open geek0075 opened 2 years ago

geek0075 commented 2 years ago

Hi All, Hi Professor @zhangchenxu528,

I have been working with this repository for over 1 month now. I have raised an issue (https://github.com/zhangchenxu528/FACIAL/issues/61), which has been successfully resolved. So I am now able to successfully create Talking Heads using this repository.

However, there are several other issues to consider, even after being able to successfully create a talking head. Some of these issues include but are not limited to:

  1. The FACIAL network is trained with a video from people (whom I refer to here as Avatars) where they read some script and we use their lip movements and head poses and eye blinks to fine-tune the FACIAL network.
  2. After the FACIAL network has been fine-tuned with video of an Avatar (where the avatar is reading some script and evoking corresponding lip movements, head poses, and eye blinks), we then use the FACIAL network in inference to make the talking head avatar read different scripts from the script he/she read during training.

So in accordance with the above, their exists the following scripts (text content to be read)

  1. Training script. This is the script read by the avatar to create the training video.
  2. Test script. This is the script that we make the avatar 'speak' after training and during inference.

My first question related to deploying a useful FACIAL network is this:

What is an optimal Training script to make the training avatar read, such that once done, the trained avatar can now successfully read any and every possible Test script? I have seen that several shops make the avatar actors read and evoke to text pangrams (https://en.wikipedia.org/wiki/Pangram). Is this the way to go?

I am hoping to get some direction on this issue from the Professor, @zhangchenxu528 ....

Kind Regards & Thanks.

Kay.

flipkast commented 2 years ago

Hi @geek0075, we have a lot of actors that are willing to be used as avatars. If you managed to make FACIAL work successfully, let me know. We can work together and test multiple scripts until we get optimal results with FACIAL.

geek0075 commented 2 years ago

Hi. Yes. @flipkast . I made FACIAL work. What do you have in mind? Thanks...

flipkast commented 1 year ago

Hi @geek0075 you can pm me at niklentis@gmail.com to discuss.

geek0075 commented 1 year ago

Hi @flipkast. Done....

NikitaKononov commented 1 year ago

Hi @flipkast. Done....

Hello, can you please share your results? Generated video samples My results are kinda bad. Eyes behaviour is very strange

geek0075 commented 1 year ago

Carlson doing Obama Speech https://drive.google.com/file/d/1wh785MZ0W_3Ny5nOgTeVOf0uGH2gjQNV/view?usp=sharing

Carlson doing some speech https://drive.google.com/file/d/1Q0jf1a2MsWfInMHfL0luAVaRuAakHm4m/view?usp=sharing

geek0075 commented 1 year ago

I trained a Carlson model from YouTube video of Mr Carlson. I made him do the Obama test speech available on the original FACIAL repository. Also made him do some other speech. Both results are given above...Let me hear your views. Cheers.

NikitaKononov commented 1 year ago

I trained a Carlson model from YouTube video of Mr Carlson. I made him do the Obama test speech available on the original FACIAL repository. Also made him do some other speech. Both results are given above...Let me hear your views. Cheers.

Thanks for sharing your results! Video is pixelised, i think you should train face2video model for more iterations And he doesn't blink at all, why? You removed openface features?

geek0075 commented 1 year ago

Yes - video is pixelised because I increased the size during preprocessing only. Otherwise it is good. The output talking head's quality depends a lot on the quality and size of the input training video. I preprocessed my input video by cropping to a section and sizing to 512x512. That's why it's pixelised...

geek0075 commented 1 year ago

Ideally you would capture the input video as square with a high pixel resolution focusing only on the face only. The input video for that Carlson example was no such thing ;-). I had to crop off a square section of it and then blow it up to 512x512 pixel resolution...

geek0075 commented 1 year ago

No - I did not remove open face results. The talking head is like the real carlson training video - no blinks there either. I will search my hard-disk for the original training video and share...

geek0075 commented 1 year ago

Here's a biden talking head:

https://drive.google.com/file/d/1fg90JjsEyJM5yhr4zIKVEw_ShCehea1R/view?usp=sharing

You can see this is even lower quality than the Carlson one. This is because the input is also low quality...

geek0075 commented 1 year ago

Here's the original Carlson training video from the YouTube extract:

https://drive.google.com/file/d/1a5wvr9ThOG_SGPRrs8yjoSKTJU9RfSFj/view?usp=sharing

I had to crop it only to the carlson head, say with resolution 80x80 (capturing neck up and head only), then blow that up (to 512x512);

https://drive.google.com/file/d/1FnvnvPk7mFC0jSGwPd1etyoIkr8DQaCJ/view?usp=sharing

That is the first frame of the training video. You can see that it is pixelated. Just like the output is ;-).

geek0075 commented 1 year ago

@NikitaKononov It's about the quality of the input training video! I used videos extracted from YouTube which I subsequently applied preprocessing to in order to bring them to a size and format that I use to train FACIAL. YouTube videos are not professionally shot to create talking head videos ;-).

In order to build a commercial or even non-commercial product out of this, one should create the input videos specifically for making talking head videos kinda like the original training video that comes with the FACIAL repo - you know that lady talking...

Good luck in your efforts.

geek0075 commented 1 year ago

I am not really worried about pixelation. I understand how and why that came by. All I have to do is fine-tune with a non-pixelated training video and I will get a non-pixelated talking head. I am 100% sure of this...

NikitaKononov commented 1 year ago

I am not really worried about pixelation. I understand how and why that came by. All I have to do is fine-tune with a non-pixelated training video and I will get a non-pixelated talking head. I am 100% sure of this...

Oh yeah, I see Thanks for you advice, really helpful and interesting! Yes, videos should be kinda specific for that task

The idea of FACIAL is great, but I think some things might be improved and boost quality a lot

I am thinking about replacing pix2pix model (face2video) with some modern generative model Or scale up input/output resolution for current pix2pix model

Also I think, that deepspeech isn't so good at features Maybe audio feature mechanism from wav2lip will be better. We will be able to train ultimate audio2face model on large dataset And then train only face2video for any person

Have big plans on research and development for this repo

NikitaKononov commented 1 year ago

I am not really worried about pixelation. I understand how and why that came by. All I have to do is fine-tune with a non-pixelated training video and I will get a non-pixelated talking head. I am 100% sure of this...

Have you made any improvements for this repo? Or you use code "as it is" Maybe you have some advice for choosing hyperparameters of models in this work or something else. It will be great, if you will be so kind to share some tips. Thanks!

geek0075 commented 1 year ago

Hey @NikitaKononov sorry for late response. I was looking at Pix2Pix and researching to see if a better image-to-image model exists. I did not find anything yet. I notice tha FACIAL uses Nvidia Pix2PixHD

https://catalog.ngc.nvidia.com/orgs/nvidia/models/pix2pixhd https://github.com/NVIDIA/pix2pixHD

StyleGAN is a state of the art (also created by Nvidia), and I found some image-to-image model based off StyleGAN:

https://arxiv.org/abs/2008.00951

My research into FACIAL is just 2 months old ;-). Modifying and improving the model should take much much longer than 2 months. In 2 months I was able to use the code as is properly. Improvements require experimentation which takes time....

The version of DeepSpeech used in FACIAL is old. You might simply start by retraining DeepSpeech of a more recent version and using with current FACIAL pipeline. I had considered doing that but stopped due to lack of infrastructure...

I like your enthusiasm on the FACIAL project. Please feel free to keep me abreast of your research...Cheers.

Mxwgreat commented 1 year ago

Hi!@geek0075,I saw the results of your reappearance on FACIAL, and you did a very good job. I am also very interested in FACIAL. But I have encountered some problems in the process of running FACIAL. For example, There are no /obama2/test_1.avi, obama2/test_1_audio.avi and obama2/test_1_audio.avi in the path ../examples/test_image/. and this problem corresponds to the code is: video_new = '../examples/test_image/obama2/test_1.avi' output = '../examples/test_image/obama2/test_1_audio.avi' output_mp4 = '../examples/test_image/obama2/test_1_audio.mp4' So I would like to ask if you have encountered similar problems. I am looking forward to your reply, Thanks!

mendynew commented 1 year ago

@geek0075 good job! How did you make the training code work? which version of tensorflow did you use? I have encountered some problems to to train this model, would you mind to help me with the issue #81 ? Thank you very much!

geek0075 commented 1 year ago

Hi. I have been busy. Will review and revert to your comments. Thanks.

Mxwgreat commented 1 year ago

@geek0075 Hi, I am looking forward to your reply. Thanks

geek0075 commented 1 year ago

README.md

@Mxwgreat Again accept my apologies for my late response. The attached is a sort of guide I wrote a while ago when I first started with FACIAL ;-). I outgrew it, but I am sure it will still be useful to you. Please have a READ through every section and let me know if it addresses your issues...

Cheers.

Mxwgreat commented 1 year ago

@geek0075 Thanks for your sharing! I will read it carefully.

geek0075 commented 1 year ago

@Mxwgreat Your welcome. Like I said I created it at the start of my journey with FACIAL. If you read my issues raised you can see issues I encountered and how they were solved. Also I got so many great insights into creating talking heads with FACIAL, but then I dropped it and am now focused on other development efforts...

https://github.com/zhangchenxu528/FACIAL/issues/61 https://github.com/zhangchenxu528/FACIAL/issues/66 https://github.com/NVlabs/nvdiffrast/issues/77

Also so many small issues will crop up along the way. You can check with me and I will let you know how I solved them optimally eventually...

Cheers.

amzzz2020 commented 1 year ago

Hi! @geek0075,I saw all your issues. i used the deep3d_pytorch to generate lots of .mat , how do i use it to replace that example video_preproccess. Your guide did not saw how to change it.

geek0075 commented 1 year ago

Hi @amzzz2020. Let me get back to you ASAP. Cheers.

geek0075 commented 1 year ago

@amzzz2020 Please have a look here: https://colab.research.google.com/drive/1Z1tFPFf-O_HpaxshTqKM24TC_rrjR7Xc?usp=sharing

geek0075 commented 1 year ago

@amzzz2020 I discussed this issue here:

https://github.com/zhangchenxu528/FACIAL/issues/61

If you look at file:

https://github.com/zhangchenxu528/FACIAL/blob/main/face_render/handle_netface.py

You will see that the files (*.mat) will be loaded from '/content/FACIAL/video_preprocess/train1_deep3Dface'. So you either place them there or place them elsewhere and use the parameter: '--param_folder'.

Let me know how this works out for you...

Cheers.

amzzz2020 commented 1 year ago

@geek0075 Thank you for your reply. According to the previous issue, I have solved the problem, but the generated result is very poor

geek0075 commented 1 year ago

Hi. @amzzz2020 Please share your result here and Thanks.

Mxwgreat commented 1 year ago

Hi, @geek0075 eyemask.npy in the face_render folder, loaded directly by render_netface_fitpose.py and rendering_gaosi.py, for example, mask3 = np.load('eyemask.npy'). I would be interested to know how eyemask.npy is generated, and in this project, eyemask.npy is provided directly in the face_render folder. I wonder if you've thought about it? I am looking forward to your reply. Thanks!

geek0075 commented 1 year ago

@Mxwgreat sorry for my late response. I have not explored to the point of the eyemask. Been up to many different things since exploring this repository. Wish I had the resources to keep on exploring...

Mxwgreat commented 1 year ago

Thanks for your reply. I will continue to explore! ----- 原始邮件 ----- 发件人:geek0075 @.> 收件人:zhangchenxu528/FACIAL @.> 抄送人:Mxwgreat @.>, Mention @.> 主题:Re: [zhangchenxu528/FACIAL] Talking Head Avatar: Optimum Script for Training (Issue #66) 日期:2023年03月04日 22点51分

@Mxwgreat sorry for my late response. I have not explored to the point of the eyemask. Been up to many different things since exploring this repository. Wish I had the resources to keep on exploring...

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>