Better LJSpeech or LibriTTS for finetuning a single speaker voice? Or training from scratch with not so much data?

yl4579 / StyleTTS2

StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models

MIT License

4.38k stars 340 forks source link

Better LJSpeech or LibriTTS for finetuning a single speaker voice? Or training from scratch with not so much data? #226

Open Sweetapocalyps3 opened 3 months ago

Sweetapocalyps3 commented 3 months ago

Hi everyone,

I'm wondering if it should be LJSpeech or LibriTTS the proper candidate to be used to finetune a single person voice. I've seen that there is a multispeaker boolean field in the configuration, which in my case should be set to false, but I don't know if this imply I have to use LJSpeech, since LibriTTS is a multispeaker.

Maybe is it even better to train the model from scratch? I'm thinking about it, but I suppose I have too few samples (126 files of clean audio for a total of almost 19 minutes)

Thank you in advance.

meng2468 commented 2 months ago

LibriTTS is by far the better choice, the model has seen multiple speakers, and can adapt far better to a smaller dataset for a single speaker.

You can leave all of the settings in config_ft.yml the same (Changing only dataset, then batch size and window size depending on your hardware). Multi-speaker should be kept on true, just make sure that in your dataset metafiles the speaker_id is set to the same id for each file.

Training the model from scratch from with 19 minutes of data will most likely yield bad results, although I haven't tried myself.

Helpful details on fine-tuning: https://github.com/yl4579/StyleTTS2/discussions/81

GUUser91 commented 1 month ago

You can use vokan. https://huggingface.co/ShoukanLabs/Vokan

traderpedroso commented 1 month ago

You can use vokan.

https://huggingface.co/ShoukanLabs/Vokan

The expressions and emphasis in the voices sound really natural, but there are always noises at the beginning and especially at the end. I believe a pad of silence at the start and end was missing during the training.