A repository for generating stylized talking 3D faces and 2D videos. This is the repository for paper Imitating Arbitrary Talking Style for Realistic Audio-Driven Talking Face Synthesis, MM 2021 The demo video can be viewed in this link: https://hcsi.cs.tsinghua.edu.cn/demo/MM21-HAOZHEWU.mp4.
conda create -n python36 python=3.6
conda activate python36
pip install -r requirements.txt
./deepspeech
folder../audio2motion/model
./render/model
To run our demo, you need at least one GPU with 11G GPU memory.
python demo.py --in_img [*.png] --in_audio [*.wav] --output_path [path]
We provide 10 example talking styles in style.npy
, you can also calculate your own style codes with the following code. Where the exp is the 3DMM series and pose is the pose matrix reconstructed from Deep 3D Face Reconstruction. Usually we calculate style codes with videos of 5-20 seconds.
def get_style_code(exp, pose):
exp_mean_std = pkl.load(open("./data/ted_hd/exp_mean_std.pkl", 'rb'))
exp_std_mean = exp_mean_std['s_m']
exp_std_std = exp_mean_std['s_s']
exp_diff_std_mean = exp_mean_std['d_s_m']
exp_diff_std_std = exp_mean_std['d_s_s']
pose_mean_std = pkl.load(open("./data/ted_hd/pose_mean_std.pkl", 'rb'))
pose_diff_std_mean = pose_mean_std['d_s_m']
pose_diff_std_std = pose_mean_std['d_s_s']
diff_exp = exp[:-1, :] - exp[1:, :]
exp_std = (np.std(exp, axis = 0) - exp_std_mean) / exp_std_std
diff_exp_std = (np.std(diff_exp, axis = 0) - exp_diff_std_mean) / exp_diff_std_std
diff_pose = pose[:-1, :] - pose[1:, :]
diff_pose_std = (np.std(diff_pose, axis = 0) - pose_diff_std_mean) / pose_diff_std_std
return np.concatenate((exp_std, diff_exp_std, diff_pose_std))
Notice that the pose of each talking face is static in current demo, you can control the pose of face by modifying the coeff_array in demo.py in line 93. The coeff_array has shape of $N * 257$ , where $N$ is framesize, vector of $257$ dimensions has same definition as deep 3d face reconstruction, where $254-257$ dim controls the translation, and $224-227$ dim controls euler angles for pose.
Our project organizes the files as follows:
├── README.md
├── data_process
├── deepspeech
├── face_alignment
├── deep_3drecon
├── render
├── audio2motion
The data process folder contains processing code of several datasets.
We leverage the DeepSpeech project to extract audio related features. Please download the pretrained deepspeech model from the Link. In deepspeech/evaluate.py
, we implement the funtion get_prob
to get the latent deepspeech features with input audio path. The latent deepspeech features have 50 frames per second. We should align the deepspeech features to 25 fps videos in subsequent implementations.
We modify Face Alignment for data preprocess. Different from the original project, we enforce the face alignment to detect only the largest face in each frame for speed-up.
We modify Deep 3D Face Reconstruction for data preprocess. We add batch-api, uv-texture unwarpping api and uv coodinate image generation api in deep_3drecon/utils.py
.
We implement our texture encoder and rendering model in the render folder. We also implement some other renders like neural voice puppertry.
We implement our stylized audio to facial motion model in audio2motion folder.
We leverage lmdb
to store the fragmented data. The data can be downloaded from link, and then run cat xa* > data.mdb
. You can obtain the train/test video with the code bellow. We use the Ted-HD data to train the audio2motion model. We also provide the reconstructed 3D param and landmarks in the lmdb.
import lmdb
def test():
lmdb_path = "./lmdb"
env = lmdb.open(lmdb_path, map_size=1099511627776, max_dbs = 64)
train_video = env.open_db("train_video".encode())
train_audio = env.open_db("train_audio".encode())
train_lm5 = env.open_db("train_lm5".encode())
test_video = env.open_db("test_video".encode())
test_audio = env.open_db("test_audio".encode())
test_lm5 = env.open_db("test_lm5".encode())
with env.begin(write = False) as txn:
video = txn.get(str(0).encode(), db=test_video)
audio = txn.get(str(0).encode(), db=test_audio)
video_file = open("test.mp4", "wb")
audio_file = open("test.wav", "wb")
video_file.write(video)
audio_file.write(audio)
video_file.close()
audio_file.close()
print(txn.stat(db=train_video))
print(txn.stat(db=test_video)) # we can obtain the database size here
For the training of render, we will not provide the processed dataset due to the license of LRW.
@inproceedings{wu2021imitating,
title={Imitating Arbitrary Talking Style for Realistic Audio-Driven Talking Face Synthesis},
author={Wu, Haozhe and Jia, Jia and Wang, Haoyu and Dou, Yishun and Duan, Chao and Deng, Qingshan},
booktitle={Proceedings of the 29th ACM International Conference on Multimedia},
pages={1478--1486},
year={2021}
}