yunik1004 / SAiD

SAiD: Blendshape-based Audio-Driven Speech Animation with Diffusion
https://yunik1004.github.io/SAiD/
Apache License 2.0
89 stars 17 forks source link

SAiD: Blendshape-based Audio-Driven Speech Animation with Diffusion

This is the code for SAiD: Blendshape-based Audio-Driven Speech Animation with Diffusion.

Installation

Run the following command to install it as a pip module:

pip install .

If you are developing this repo or want to run the scripts, run instead:

pip install -e .[dev]

If there is an error related to pyrender, install additional packages as follows:

apt-get install libboost-dev libglfw3-dev libgles2-mesa-dev freeglut3-dev libosmesa6-dev libgl1-mesa-glx

Directories

Inference

You can download the pretrained weights of SAiD from Hugging Face Repo.

python script/inference.py \
        --weights_path "<SAiD_weights>.pth" \
        --audio_path "<input_audio>.wav" \
        --output_path "<output_coeffs>.csv" \
        [--init_sample_path "<input_init_sample>.csv"] \  # Required for editing
        [--mask_path "<input_mask>.csv"]  # Required for editing

BlendVOCA

Construct Blendshape Facial Model

Due to the license issue of VOCASET, we cannot distribute BlendVOCA directly. Instead, you can preprocess data/blendshape_residuals.pickle after constructing BlendVOCA directory as follows for the simple execution of the script.

├─ audio-driven-speech-animation-with-diffusion
│  ├─ ...
│  └─ script
└─ BlendVOCA
   └─ templates
      ├─ ...
      └─ FaceTalk_170915_00223_TA.ply
python script/preprocess_blendvoca.py \
        --blendshapes_out_dir "<output_blendshapes_dir>"

If you want to generate blendshapes by yourself, do the folowing instructions.

  1. Unzip data/ARKit_reference_blendshapes.zip.
  2. Download the template meshes from VOCASET.
  3. Crop template meshes using data/FLAME_head_idx.txt. You can crop more indices and then restore them after finishing the construction process.
  4. Use Deformation-Transfer-for-Triangle-Meshes to construct the blendshape meshes.
    • Use data/ARKit_landmarks.txt and data/FLAME_head_landmarks.txt as marker vertices.
    • Find the correspondance map between neutral meshes, and use it to transfer the deformation of arbitrary meshes.
  5. Create blendshape_residuals.pickle, which contains the blendshape residuals in the following Python dictionary format. Refer to data/blendshape_residuals.pickle.

    {
        'FaceTalk_170731_00024_TA': {
            'jawForward': <np.ndarray object with shape (V, 3)>,
            ...
        },
        ...
    }

Generate Blendshape Coefficients

You can simply unzip data/blendshape_coeffcients.zip.

If you want to generate coefficients by yourself, we recommend constructing the BlendVOCA directory as follows for the simple execution of the script.

├─ audio-driven-speech-animation-with-diffusion
│  ├─ ...
│  └─ script
└─ BlendVOCA
   ├─ blendshapes_head
   │  ├─ ...
   │  └─ FaceTalk_170915_00223_TA
   │     ├─ ...
   │     └─ noseSneerRight.obj
   ├─ templates_head
   │  ├─ ...
   │  └─ FaceTalk_170915_00223_TA.obj
   └─ unposedcleaneddata
      ├─ ...
      └─ FaceTalk_170915_00223_TA
         ├─ ...
         └─ sentence40

And then, run the following command:

python script/optimize_blendshape_coeffs.py \
        --blendshapes_coeffs_out_dir "<output_coeffs_dir>"

After generating blendshape coefficients, create coeffs_std.csv, which contains the standard deviation of each coefficients. Refer to data/coeffs_std.csv.

jawForward,...
<std_jawForward>,...

Training / Evaluation on BlendVOCA

Dataset Directory Setting

We recommend constructing the BlendVOCA directory as follows for the simple execution of scripts.

├─ audio-driven-speech-animation-with-diffusion
│  ├─ ...
│  └─ script
└─ BlendVOCA
   ├─ audio
   │  ├─ ...
   │  └─ FaceTalk_170915_00223_TA
   │     ├─ ...
   │     └─ sentence40.wav
   ├─ blendshape_coeffs
   │  ├─ ...
   │  └─ FaceTalk_170915_00223_TA
   │     ├─ ...
   │     └─ sentence40.csv
   ├─ blendshapes_head
   │  ├─ ...
   │  └─ FaceTalk_170915_00223_TA
   │     ├─ ...
   │     └─ noseSneerRight.obj
   └─ templates_head
      ├─ ...
      └─ FaceTalk_170915_00223_TA.obj

Training VAE, SAiD

Evaluation

  1. Generate SAiD outputs on the test speech data

    python script/test_inference.py \
            --weights_path "<SAiD_weights>.pth" \
            --output_dir "<output_coeffs_dir>"
  2. Remove FaceTalk_170809_00138_TA/sentence32-xx.csv files from the output directory. Ground-truth data does not contain the motion data of FaceTalk_170809_00138_TA/sentence32.

  3. Evaluate SAiD outputs: FD, WInD, and Multimodality.

    python script/test_evaluate.py \
            --coeffs_dir "<input_coeffs_dir>" \
            [--vae_weights_path "<VAE_weights>.pth"] \
            [--blendshape_residuals_path "<blendshape_residuals>.pickle"]
  4. We have to generate the videos to compute the AV offset/confidence. To avoid the memory leak issue of the pyrender module, we use the shell script. After updating COEFFS_DIR and OUTPUT_DIR, run the script:

    # Fix 1: COEFFS_DIR="<input_coeffs_dir>"
    # Fix 2: OUTPUT_DIR="<output_video_dir>"
    python script/test_render.sh
  5. Use SyncNet to compute the AV offset/confidence.

Reference

If you use this code as part of any research, please cite the following paper.

@misc{park2023said,
      title={SAiD: Speech-driven Blendshape Facial Animation with Diffusion},
      author={Inkyu Park and Jaewoong Cho},
      year={2023},
      eprint={2401.08655},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}