This repository contains code for LVC-VC, a zero-shot voice conversion model described in our Interspeech 2023 paper, End-to-End Zero-Shot Voice Conversion with Location-Variable Convolutions, implemented in PyTorch.
Additionally, it includes code for a larger, improved version of our model (not described in the paper), which we call LVC-VC XL. This version of the model uses a larger channel size of 32 (rather than 16) in its LVC layers, utilizes embeddings from XLSR-53 as content features, and uses information perturbation to extract only linguistic information from them (as done in NANSY). It also uses speaker embeddings from ECAPA-TDNN rather than Fast ResNet-34. LVC-VC XL achieves significantly better performance over the base version of our model in terms of both intelligibility and voice style transfer performance, and we encourage you to use it rather than the base version if memory and compute allow.
Audio samples are available on our demo page.
If you find this work or our code useful, please consider citing our paper:
@inproceedings{kang23b_interspeech,
author={Wonjune Kang and Mark Hasegawa-Johnson and Deb Roy},
title={{End-to-End Zero-Shot Voice Conversion with Location-Variable Convolutions}},
year=2023,
booktitle={Proc. INTERSPEECH 2023},
pages={2303--2307},
doi={10.21437/Interspeech.2023-2298}
}
You can install all dependencies by running
pip install -r requirements.txt
Create a directory called weights
in the working directory, and save the pretrained weights from the Google Drive link. We include pre-trained weights for LVC-VC, Fast ResNet-34, LVC-VC XL, and ECAPA-TDNN.
If you want to train a model from scratch, you will need to download the VCTK dataset. Then, run
./preprocess_data.sh
to preprocess all the data. This script will:
{speaker_id: {'median': --, 'std': --}}
in pickle format{'speaker_id': sklearn.mixture.GaussianMixture object}
in pickle formatNote that the preprocessing scripts have directories and file paths hardcoded in. Therefore, you will need to go in and change them as needed if running on your own machine. The script will also extract and preprocess data needed for both the base and XL versions of LVC-VC. If you are only interested in training one or the other, then comment out the corresponding parts of the code as needed.
You can train a model by specifying a config file, base GPU index, and run name. Note that the base GPU index specifies the first GPU to use on your machine, and then uses the next consecutive num_gpus
specified in the config file (e.g. if you specify -g 0
and num_gpus: 4
, then you will train using GPUs [0,1,2,3]
. You can also continue training from a checkpoint using the -p
flag.
python3 trainer.py \
-c config/config_wav2vec_ecapa_c32.yaml \
-g 0 \
-n lvc_vc_wav2vec_ecapa_c32
If you are training the base version of LVC-VC using spectrograms as content features, you will also need to supplement self-reconstructive training with speaker similarity criterion (SSC). To do this, first train a model to convergence using config/config_spect_c16.yaml
, and then continue training from the last checkpoint with config/config_spect_c16_ssc.yaml
. Training with SSC loss will save model checkpoints every 400 iterations; you may need to test a few checkpoints to find the optimal trade-off between audio quality and voice style transfer performance.
(This is one of the reasons why we encourage you to use LVC-VC XL; it achieves better performance without needing the additional SSC loss step.)
Depending on which version of the model you are using, run either inference_wav2vec.py
or inference_spect.py
. If you are running inference_wav2vec.py
without having run the data preprocessing first, you can use the metadata pickle files in the metadata
directory of this repository.
python3 inference_wav2vec.py \
-c config/config_wav2vec_ecapa_c32.yaml \
-p weights/lvc_vc_xl_vctk.pt \
-e weights/ecapa_tdnn_pretrained.pt \
-g 0 \
-s source_utterance_file \
-t target_utterance_file \
-o output_file_name
python3 inference_spect.py \
-c config/config_spect_c16.yaml \
-p weights/lvc_vc_vctk.pt \
-e weights/resnet34sel_pretrained.pt \
-g 0 \
-s source_utterance_file \
-t target_utterance_file \
-o output_file_name
We referred to the following repositories and resources in our code:
utils/perturbations.py
(used for LVC-VC XL)