This is the code for SAiD: Blendshape-based Audio-Driven Speech Animation with Diffusion
.
Run the following command to install it as a pip module:
pip install .
If you are developing this repo or want to run the scripts, run instead:
pip install -e .[dev]
If there is an error related to pyrender, install additional packages as follows:
apt-get install libboost-dev libglfw3-dev libgles2-mesa-dev freeglut3-dev libosmesa6-dev libgl1-mesa-glx
data
: It contains data used for preprocessing and training.model
: It contains the weights of VAE, which is used for the evaluation.blender-addon
: It contains the blender addon that can visualize the blendshape coefficients.script
: It contains Python scripts for preprocessing, training, inference, and evaluation.static
: It contains the resources for the project page.You can download the pretrained weights of SAiD from Hugging Face Repo.
python script/inference.py \
--weights_path "<SAiD_weights>.pth" \
--audio_path "<input_audio>.wav" \
--output_path "<output_coeffs>.csv" \
[--init_sample_path "<input_init_sample>.csv"] \ # Required for editing
[--mask_path "<input_mask>.csv"] # Required for editing
Due to the license issue of VOCASET, we cannot distribute BlendVOCA directly.
Instead, you can preprocess data/blendshape_residuals.pickle
after constructing BlendVOCA
directory as follows for the simple execution of the script.
├─ audio-driven-speech-animation-with-diffusion
│ ├─ ...
│ └─ script
└─ BlendVOCA
└─ templates
├─ ...
└─ FaceTalk_170915_00223_TA.ply
templates
: Download the template meshes from VOCASET.python script/preprocess_blendvoca.py \
--blendshapes_out_dir "<output_blendshapes_dir>"
If you want to generate blendshapes by yourself, do the folowing instructions.
data/ARKit_reference_blendshapes.zip
.data/FLAME_head_idx.txt
.
You can crop more indices and then restore them after finishing the construction process.data/ARKit_landmarks.txt
and data/FLAME_head_landmarks.txt
as marker vertices.Create blendshape_residuals.pickle
, which contains the blendshape residuals in the following Python dictionary format.
Refer to data/blendshape_residuals.pickle
.
{
'FaceTalk_170731_00024_TA': {
'jawForward': <np.ndarray object with shape (V, 3)>,
...
},
...
}
You can simply unzip data/blendshape_coeffcients.zip
.
If you want to generate coefficients by yourself, we recommend constructing the BlendVOCA
directory as follows for the simple execution of the script.
├─ audio-driven-speech-animation-with-diffusion
│ ├─ ...
│ └─ script
└─ BlendVOCA
├─ blendshapes_head
│ ├─ ...
│ └─ FaceTalk_170915_00223_TA
│ ├─ ...
│ └─ noseSneerRight.obj
├─ templates_head
│ ├─ ...
│ └─ FaceTalk_170915_00223_TA.obj
└─ unposedcleaneddata
├─ ...
└─ FaceTalk_170915_00223_TA
├─ ...
└─ sentence40
blendshapes_head
: Place the constructed blendshape meshes (head).templates_head
: Place the template meshes (head).unposedcleaneddata
: Download the mesh sequences (unposed cleaned data) from VOCASET.And then, run the following command:
python script/optimize_blendshape_coeffs.py \
--blendshapes_coeffs_out_dir "<output_coeffs_dir>"
After generating blendshape coefficients, create coeffs_std.csv
, which contains the standard deviation of each coefficients. Refer to data/coeffs_std.csv
.
jawForward,...
<std_jawForward>,...
We recommend constructing the BlendVOCA
directory as follows for the simple execution of scripts.
├─ audio-driven-speech-animation-with-diffusion
│ ├─ ...
│ └─ script
└─ BlendVOCA
├─ audio
│ ├─ ...
│ └─ FaceTalk_170915_00223_TA
│ ├─ ...
│ └─ sentence40.wav
├─ blendshape_coeffs
│ ├─ ...
│ └─ FaceTalk_170915_00223_TA
│ ├─ ...
│ └─ sentence40.csv
├─ blendshapes_head
│ ├─ ...
│ └─ FaceTalk_170915_00223_TA
│ ├─ ...
│ └─ noseSneerRight.obj
└─ templates_head
├─ ...
└─ FaceTalk_170915_00223_TA.obj
audio
: Download the audio from VOCASET.blendshape_coeffs
: Place the constructed blendshape coefficients.blendshapes_head
: Place the constructed blendshape meshes (head).templates_head
: Place the template meshes (head).Train VAE
python script/train_vae.py \
--output_dir "<output_logs_dir>" \
[--coeffs_std_path "<coeffs_std>.txt"]
Train SAiD
python script/train.py \
--output_dir "<output_logs_dir>"
Generate SAiD outputs on the test speech data
python script/test_inference.py \
--weights_path "<SAiD_weights>.pth" \
--output_dir "<output_coeffs_dir>"
Remove FaceTalk_170809_00138_TA/sentence32-xx.csv
files from the output directory.
Ground-truth data does not contain the motion data of FaceTalk_170809_00138_TA/sentence32
.
Evaluate SAiD outputs: FD, WInD, and Multimodality.
python script/test_evaluate.py \
--coeffs_dir "<input_coeffs_dir>" \
[--vae_weights_path "<VAE_weights>.pth"] \
[--blendshape_residuals_path "<blendshape_residuals>.pickle"]
We have to generate the videos to compute the AV offset/confidence.
To avoid the memory leak issue of the pyrender module, we use the shell script.
After updating COEFFS_DIR
and OUTPUT_DIR
, run the script:
# Fix 1: COEFFS_DIR="<input_coeffs_dir>"
# Fix 2: OUTPUT_DIR="<output_video_dir>"
python script/test_render.sh
Use SyncNet to compute the AV offset/confidence.
If you use this code as part of any research, please cite the following paper.
@misc{park2023said,
title={SAiD: Speech-driven Blendshape Facial Animation with Diffusion},
author={Inkyu Park and Jaewoong Cho},
year={2023},
eprint={2401.08655},
archivePrefix={arXiv},
primaryClass={cs.CV}
}