yochaiye / LipVoicer

Official Code implementation for the ICLR paper "LipVoicer: Generating Speech from Silent Videos Guided by Lip Reading"
MIT License
46 stars 7 forks source link
diffusion-models lip-to-speech

LipVoicer: Generating Speech from Silent Videos Guided by Lip Reading

[Paper](https://openreview.net/pdf?id=ZZCPSC5OgD) | [Demo Page](https://lipvoicer.github.io/) | [Introduction](#introduction) | [Test Files](#benchmarks-test-audio-signals-generated-by-lipvoicer) | [Pretrained Models](#pretrained-models) | [Inference](#inference) | [Training](#training)

Authors

Yochai Yemini, Aviv Shamsian, Lior Bracha, Sharon Gannot and Ethan Fetaya

Demo Page

The demo page includes many sample videos and comparisons to other baselines.

Introduction

Official implementation of LipVoicer, a lip-to-speech method. Given a silent video, we first predict the spoken text using a pre-trained lip-reading network. We then condition a diffusion model on the video and use the extracted text through a classifier-guidance mechanism where a pre-trained ASR serves as the classifier. LipVoicer outperforms multiple lip-to-speech baselines on LRS2 and LRS3, which are in-the-wild datasets with hundreds of unique speakers in their test set and an unrestricted vocabulary.

The lip reading network used in LipVoicer is taken from the Visual Speech Recognition for Multiple Languages repository. The ASR system is adapted from Audio-Visual Efficient Conformer for Robust Speech Recognition.

Installation

  1. Clone the repository:
    git clone https://github.com/yochaiye/LipVoicer.git
    cd LipVoicer
  2. Install the required packages and ffmpeg
    pip install -r requirements.txt
    conda install -c conda-forge ffmpeg
    cd ..
  3. Install ibug.face_detection
    git clone https://github.com/hhj1897/face_detection.git
    cd face_detection
    git lfs pull
    pip install -e .
    cd ..
  4. Install ibug.face_alignment
    git clone https://github.com/hhj1897/face_alignment.git
    cd face_alignment
    pip install -e .
    cd ..
  5. Install RetinaFace or MediaPipe face tracker
  6. Install ctcdecode for the ASR beam search
    git clone --recursive https://github.com/WayenVan/ctcdecode.git
    cd ctcdecode
    pip install .
    cd ..

Benchmarks Test Audio Signals Generated by LipVoicer

We provide the audio generated by LipVoicer for the test videos of LRS2 and LRS3. They were used to compute the metrics in the paper, and therefore it will hopefully facilitate future comparisons.

The links are given below:

Pretrained Models

We provide pretrained checkpoints for LipVoicer so you can kick-start generating speech for silent videos. You can download checkpoint for the following models

The simplest and fastet way to download the models is to run

python download_checkpoints.py

which will download all the pretrained checkpoints and put them in the right place in the repository.

Alternatively, you can download individual checkpoint from Google Drive

Inference

In-the-Wild Videos

To generate a speech signal for your video, you first need to edit the following arguments in the hydra config file

You can also play with the values of w_video, w_asr, ast_start. Then run the following command

python inference_real_video.py

which will also take care of converting the video fps rate to 25Hz if necessary, mouth cropping and lip-reading.

Test Videos for LRS2/LRS3

If you wish to generate audio files for all of the test videos of LRS2/LRS3, first download the predicted lip-readings (LRS2, LRS3), and then use the following

python inference_full_test_split.py generate.ckpt_path=<path_to_MelGen_ckpt>
                                    generate.save_dir=<save_dir> \
                                    generate.lipread_text_dir=<lipread_text_dir> \
                                    dataset.videos_dir=<videos_dir> \
                                    dataset.audios_dir=<audio_dir> \
                                    dataset.mouthrois_dir=<mouthrois_dir

Training

For training LipVoicer on the benchmark datasets, please download LRS2 or LRS3.

Data Preparation

The purpose of the data preparation step is to compute the groundtruth mel-spectrograms of the benchmark videos and extract the lip region videos. At the end of the process, you should have the following directory trees for LRS2 and LRS3:

├── LRS2                          
    │       └── [videos]               (contain the video in .mp4)
    │                └── [main]
    │                └── [pretrain]
    │       └── [audios]             (contain the audio files in .wav and .wav.spec)
    │                └── [main]
    │                └── [pretrain
    │       └── [mouth_rois]         (contain the mouth ROIs in .npz)
    │                └── [main]
    │                └── [pretrain]

├── LRS3                          
    │       └── [videos]               (contain the video in .mp4)
    │                └── [pretrain]
    │                └── [trainval]
    │                └── [test]
    │       └── [audios]             (contain the audio files in .wav and .wav.spec)
    │                └── [pretrain]
    │                └── [trainval]
    │                └── [test]
    │       └── [mouth_rois]         (contain the mouth ROIs in .npz)
    │                └── [pretrain]
    │                └── [trainval]
    │                └── [test]

To this end, perform the following steps inside the LipVoicer directory:

  1. Extract the audio files from the videos (audio files will be saved in a WAV format)

    python dataloaders/extract_audio_from_video.py --ds_dir <path_to_video_dir> \
                                               --split <trainval/test/...>  \
                                               --out_dir <output_directory>

    The wav files will be saved to output_directory/split

  2. Compute the log mel-spectrograms and save them

    python dataloaders/wav2mel.py dataset.audios_dir=<path_to_directory_with_extracted_wav_files>

    It will save the mel-spectrograms with extension .wav.spec.

  3. Crop the mouth regions of the videos, convert to greyscale and save to <mouthrois_dir>. The easiest way is to

    1. Clone Visual Speech Recognition for Multiple Languages.
    2. Download the landmarks for LRS2/LRS3
    3. Copy dataloader/extract_mouthcrops.py from the LipVoicer repository and run it from the command line. It saves the greyscale mouthcrops as numpy arrays in .npz files.

Train

  1. Train MelGen

    CUDA_VISIBLE_DEVICES=0,1 python train_melgen.py train.save_dir=<save_dir> \
                                                dataset.videos_dir=<videos_path> \
                                                dataset.audios_dir=<audios_dir> \
                                                dataset.mouthrois_dir=<mouthrois_dir>

    The progress of the training stage is monitored with TensorBoard.

  2. Finetune the modified ASR, which now includes the diffusion time-step embedding. For further details on how to carry out this step, please refer to Audio-Visual Efficient Conformer for Robust Speech Recognition.

Citation

@inproceedings{
yemini2024lipvoicer,
title={LipVoicer: Generating Speech from Silent Videos Guided by Lip Reading},
author={Yochai Yemini and Aviv Shamsian and Lior Bracha and Sharon Gannot and Ethan Fetaya},
booktitle={The Twelfth International Conference on Learning Representations},
year={2024},
}