We provide PyTorch implementations for our arxiv paper "Audio-driven Talking Face Video Generation with Learning-based Personalized Head Pose"(http://arxiv.org/abs/2002.10137), and our IEEE TMM paper "Predicting Personalized Head Movement From Short Video and Speech Signal" (https://ieeexplore.ieee.org/document/9894719).
Note that this code is protected under patent. It is for research purposes only at your university (research institution) only. If you are interested in business purposes/for-profit use, please contact Prof.Liu (the corresponding author, email: liuyongjin@tsinghua.edu.cn).
We provide a demo video here (please search for "Talking Face" in this page and click the "demo video" button).
pip install -r requirements.txt
01_MorphableModel.mat
to Deep3DFaceReconstruction/BFM
folderExp_Pca.bin
to Deep3DFaceReconstruction/BFM
folderNote: You can make a video to 25 fps by
ffmpeg -i xxx.mp4 -r 25 xxx1.mp4
cd Data/
python extract_frame1.py [person_id].mp4
Deep3DFaceReconstruction/tf_mesh_renderer/mesh_renderer/kernels
to .so, following its readme, and modify line 28 in rasterize_triangles.py to your directory. Then run
cd Deep3DFaceReconstruction/
CUDA_VISIBLE_DEVICES=0 python demo_19news.py ../Data/[person_id]
This process takes about 2 minutes on a Titan Xp.
cd Audio/code/
python train_19news_1.py [person_id] [gpu_id]
The saved models are in Audio/model/atcnet_pose0_con3/[person_id]
.
This process takes about 5 minutes on a Titan Xp.
cd render-to-video/
python train_19news_1.py [person_id] [gpu_id]
The saved models are in render-to-video/checkpoints/memory_seq_p2p/[person_id]
.
This process takes about 40 minutes on a Titan Xp.
Place the audio file (.wav or .mp3) for test under Audio/audio/
.
Run [with generated poses]
cd Audio/code/
python test_personalized.py [audio] [person_id] [gpu_id]
or [with poses from short video]
cd Audio/code/
python test_personalized2.py [audio] [person_id] [gpu_id]
This program will print 'saved to xxx.mov' if the videos are successfully generated. It will output 2 movs, one is a video with face only (_full9.mov), the other is a video with background (_transbigbg.mov).
A colab demo is here.
If you use this code for your research, please cite our papers:
@article{yi2020audio,
title = {Audio-driven talking face video generation with learning-based personalized head pose},
author = {Yi, Ran and Ye, Zipeng and Zhang, Juyong and Bao, Hujun and Liu, Yong-Jin},
journal = {arXiv preprint arXiv:2002.10137},
year = {2020}
}
@article{YiYSZZWBL22,
title = {Predicting Personalized Head Movement From Short Video and Speech Signal},
author = {Yi, Ran and Ye, Zipeng and Sun, Zhiyao and Zhang, Juyong and Zhang, Guoxin and Wan, Pengfei and Bao, Hujun and Liu, Yong-Jin},
journal = {IEEE Transactions on Multimedia},
volume = {},
number = {},
pages = {1-13},
doi = {10.1109/TMM.2022.3207606}
}
The face reconstruction code is from Deep3DFaceReconstruction, the arcface code is from insightface, the gan code is developed based on pytorch-CycleGAN-and-pix2pix.