wuhaozhe / style_avatar

A repository for generating stylized talking 3D and 3D face
278 stars 34 forks source link

style_avatar

A repository for generating stylized talking 3D faces and 2D videos. This is the repository for paper Imitating Arbitrary Talking Style for Realistic Audio-Driven Talking Face Synthesis, MM 2021 The demo video can be viewed in this link: https://hcsi.cs.tsinghua.edu.cn/demo/MM21-HAOZHEWU.mp4.

framework


Quick start

Installation

conda create -n python36 python=3.6 
conda activate python36

Run

To run our demo, you need at least one GPU with 11G GPU memory.

python demo.py --in_img [*.png] --in_audio [*.wav] --output_path [path]

We provide 10 example talking styles in style.npy, you can also calculate your own style codes with the following code. Where the exp is the 3DMM series and pose is the pose matrix reconstructed from Deep 3D Face Reconstruction. Usually we calculate style codes with videos of 5-20 seconds.

def get_style_code(exp, pose):
  exp_mean_std = pkl.load(open("./data/ted_hd/exp_mean_std.pkl", 'rb'))
  exp_std_mean = exp_mean_std['s_m']
  exp_std_std = exp_mean_std['s_s']
  exp_diff_std_mean = exp_mean_std['d_s_m']
  exp_diff_std_std = exp_mean_std['d_s_s']

  pose_mean_std = pkl.load(open("./data/ted_hd/pose_mean_std.pkl", 'rb'))
  pose_diff_std_mean = pose_mean_std['d_s_m']
  pose_diff_std_std = pose_mean_std['d_s_s']

  diff_exp = exp[:-1, :] - exp[1:, :]
  exp_std = (np.std(exp, axis = 0) - exp_std_mean) / exp_std_std
  diff_exp_std = (np.std(diff_exp, axis = 0) - exp_diff_std_mean) / exp_diff_std_std

  diff_pose = pose[:-1, :] - pose[1:, :]
  diff_pose_std = (np.std(diff_pose, axis = 0) - pose_diff_std_mean) / pose_diff_std_std

  return np.concatenate((exp_std, diff_exp_std, diff_pose_std))

Notice that the pose of each talking face is static in current demo, you can control the pose of face by modifying the coeff_array in demo.py in line 93. The coeff_array has shape of $N * 257$ , where $N$ is framesize, vector of $257$ dimensions has same definition as deep 3d face reconstruction, where $254-257$ dim controls the translation, and $224-227$ dim controls euler angles for pose.


Project Overview

Our project organizes the files as follows:

├── README.md
├── data_process
├── deepspeech
├── face_alignment
├── deep_3drecon
├── render
├── audio2motion

Data process

The data process folder contains processing code of several datasets.

DeepSpeech

We leverage the DeepSpeech project to extract audio related features. Please download the pretrained deepspeech model from the Link. In deepspeech/evaluate.py, we implement the funtion get_prob to get the latent deepspeech features with input audio path. The latent deepspeech features have 50 frames per second. We should align the deepspeech features to 25 fps videos in subsequent implementations.

Face Alignment

We modify Face Alignment for data preprocess. Different from the original project, we enforce the face alignment to detect only the largest face in each frame for speed-up.

Deep 3D Face Reconstruction

We modify Deep 3D Face Reconstruction for data preprocess. We add batch-api, uv-texture unwarpping api and uv coodinate image generation api in deep_3drecon/utils.py.

Render

We implement our texture encoder and rendering model in the render folder. We also implement some other renders like neural voice puppertry.

Audio to Motion

We implement our stylized audio to facial motion model in audio2motion folder.


Data

Ted-HD data

We leverage lmdb to store the fragmented data. The data can be downloaded from link, and then run cat xa* > data.mdb. You can obtain the train/test video with the code bellow. We use the Ted-HD data to train the audio2motion model. We also provide the reconstructed 3D param and landmarks in the lmdb.

import lmdb

def test():
    lmdb_path = "./lmdb"
    env = lmdb.open(lmdb_path, map_size=1099511627776, max_dbs = 64)

    train_video = env.open_db("train_video".encode())
    train_audio = env.open_db("train_audio".encode())
    train_lm5 = env.open_db("train_lm5".encode())
    test_video = env.open_db("test_video".encode())
    test_audio = env.open_db("test_audio".encode())
    test_lm5 = env.open_db("test_lm5".encode())

    with env.begin(write = False) as txn:
        video = txn.get(str(0).encode(), db=test_video)
        audio = txn.get(str(0).encode(), db=test_audio)
        video_file = open("test.mp4", "wb")
        audio_file = open("test.wav", "wb")
        video_file.write(video)
        audio_file.write(audio)
        video_file.close()
        audio_file.close()
        print(txn.stat(db=train_video))
        print(txn.stat(db=test_video)) # we can obtain the database size here  

For the training of render, we will not provide the processed dataset due to the license of LRW.


Citation

@inproceedings{wu2021imitating,
  title={Imitating Arbitrary Talking Style for Realistic Audio-Driven Talking Face Synthesis},
  author={Wu, Haozhe and Jia, Jia and Wang, Haoyu and Dou, Yishun and Duan, Chao and Deng, Qingshan},
  booktitle={Proceedings of the 29th ACM International Conference on Multimedia},
  pages={1478--1486},
  year={2021}
}

Further works