LipVoicer: Generating Speech from Silent Videos Guided by Lip Reading

Authors

Yochai Yemini, Aviv Shamsian, Lior Bracha, Sharon Gannot and Ethan Fetaya

Demo Page

The demo page includes many sample videos and comparisons to other baselines.

Introduction

Official implementation of LipVoicer, a lip-to-speech method. Given a silent video, we first predict the spoken text using a pre-trained lip-reading network. We then condition a diffusion model on the video and use the extracted text through a classifier-guidance mechanism where a pre-trained ASR serves as the classifier. LipVoicer outperforms multiple lip-to-speech baselines on LRS2 and LRS3, which are in-the-wild datasets with hundreds of unique speakers in their test set and an unrestricted vocabulary.

The lip reading network used in LipVoicer is taken from the Visual Speech Recognition for Multiple Languages repository. The ASR system is adapted from Audio-Visual Efficient Conformer for Robust Speech Recognition.

Installation

Clone the repository:

git clone https://github.com/yochaiye/LipVoicer.git
cd LipVoicer

Install the required packages and ffmpeg

pip install -r requirements.txt
conda install -c conda-forge ffmpeg
cd ..

Install ibug.face_detection

git clone https://github.com/hhj1897/face_detection.git
cd face_detection
git lfs pull
pip install -e .
cd ..

Install ibug.face_alignment

git clone https://github.com/hhj1897/face_alignment.git
cd face_alignment
pip install -e .
cd ..

Install RetinaFace or MediaPipe face tracker

Install ctcdecode for the ASR beam search

git clone --recursive https://github.com/WayenVan/ctcdecode.git
cd ctcdecode
pip install .
cd ..

Benchmarks Test Audio Signals Generated by LipVoicer

We provide the audio generated by LipVoicer for the test videos of LRS2 and LRS3. They were used to compute the metrics in the paper, and therefore it will hopefully facilitate future comparisons.

The links are given below:

LRS2
LRS3

Pretrained Models

We provide pretrained checkpoints for LipVoicer so you can kick-start generating speech for silent videos. You can download checkpoint for the following models

MelGen trained on LRS2/LRS3
ASR finetuned for LipVoicer on LRS2/LRS3
Language model for the ASR (provided by Audio-Visual Efficient Conformer for Robust Speech Recognition)
Lip-reading network and its language model (provided by Visual Speech Recognition for Multiple Languages)
HiFi-GAN trained on 16KHz audio signals. In the paper we used DiffWave as the vocoder, but since HiFi-GAN is faster it is used here as the vocoder.

The simplest and fastet way to download the models is to run

python download_checkpoints.py

which will download all the pretrained checkpoints and put them in the right place in the repository.

Alternatively, you can download individual checkpoint from Google Drive

Inference

In-the-Wild Videos

To generate a speech signal for your video, you first need to edit the following arguments in the hydra config file

generate.ckpt_path
generate.video_path
generate.save_dir

You can also play with the values of w_video, w_asr, ast_start. Then run the following command

python inference_real_video.py

which will also take care of converting the video fps rate to 25Hz if necessary, mouth cropping and lip-reading.

Test Videos for LRS2/LRS3

If you wish to generate audio files for all of the test videos of LRS2/LRS3, first download the predicted lip-readings (LRS2, LRS3), and then use the following

python inference_full_test_split.py generate.ckpt_path=<path_to_MelGen_ckpt>
                                    generate.save_dir=<save_dir> \
                                    generate.lipread_text_dir=<lipread_text_dir> \
                                    dataset.videos_dir=<videos_dir> \
                                    dataset.audios_dir=<audio_dir> \
                                    dataset.mouthrois_dir=<mouthrois_dir

Training

For training LipVoicer on the benchmark datasets, please download LRS2 or LRS3.

Data Preparation

The purpose of the data preparation step is to compute the groundtruth mel-spectrograms of the benchmark videos and extract the lip region videos. At the end of the process, you should have the following directory trees for LRS2 and LRS3:

├── LRS2                          
    │       └── [videos]               (contain the video in .mp4)
    │                └── [main]
    │                └── [pretrain]
    │       └── [audios]             (contain the audio files in .wav and .wav.spec)
    │                └── [main]
    │                └── [pretrain
    │       └── [mouth_rois]         (contain the mouth ROIs in .npz)
    │                └── [main]
    │                └── [pretrain]

├── LRS3                          
    │       └── [videos]               (contain the video in .mp4)
    │                └── [pretrain]
    │                └── [trainval]
    │                └── [test]
    │       └── [audios]             (contain the audio files in .wav and .wav.spec)
    │                └── [pretrain]
    │                └── [trainval]
    │                └── [test]
    │       └── [mouth_rois]         (contain the mouth ROIs in .npz)
    │                └── [pretrain]
    │                └── [trainval]
    │                └── [test]

To this end, perform the following steps inside the LipVoicer directory:

Extract the audio files from the videos (audio files will be saved in a WAV format)

python dataloaders/extract_audio_from_video.py --ds_dir <path_to_video_dir> \
                                           --split <trainval/test/...>  \
                                           --out_dir <output_directory>

The wav files will be saved to output_directory/split

Compute the log mel-spectrograms and save them
```
python dataloaders/wav2mel.py dataset.audios_dir=<path_to_directory_with_extracted_wav_files>
```
It will save the mel-spectrograms with extension .wav.spec.
Crop the mouth regions of the videos, convert to greyscale and save to <mouthrois_dir>. The easiest way is to
1. Clone Visual Speech Recognition for Multiple Languages.
2. Download the landmarks for LRS2/LRS3
3. Copy dataloader/extract_mouthcrops.py from the LipVoicer repository and run it from the command line. It saves the greyscale mouthcrops as numpy arrays in .npz files.

Train

Train MelGen

CUDA_VISIBLE_DEVICES=0,1 python train_melgen.py train.save_dir=<save_dir> \
                                            dataset.videos_dir=<videos_path> \
                                            dataset.audios_dir=<audios_dir> \
                                            dataset.mouthrois_dir=<mouthrois_dir>

The progress of the training stage is monitored with TensorBoard.

Finetune the modified ASR, which now includes the diffusion time-step embedding. For further details on how to carry out this step, please refer to Audio-Visual Efficient Conformer for Robust Speech Recognition.

Citation

@inproceedings{
yemini2024lipvoicer,
title={LipVoicer: Generating Speech from Silent Videos Guided by Lip Reading},
author={Yochai Yemini and Aviv Shamsian and Lior Bracha and Sharon Gannot and Ethan Fetaya},
booktitle={The Twelfth International Conference on Learning Representations},
year={2024},
}

yochaiye / LipVoicer

readme