yl4579 / StyleTTS2

StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models
MIT License
4.48k stars 355 forks source link

Finetuning and dataset preparation #16

Closed AWAS666 closed 9 months ago

AWAS666 commented 9 months ago

First of all is it already possible to finetune a single speaker model? If so, what should one pay attention to?

Second: How do you prepare a dataset? train and val is pretty clear but the OOD_text confuse me a little, how do I get to those?

yl4579 commented 9 months ago

I think you can't really finetune a single-speaker model because it does not have enough inductive bias to adapt to other speakers with smaller data. Please wait for the multispeaker model trained on LibriTTS to finish for fine-tuning.

As for how to prepare the data, you can get OOD_text from anything not in the train and val data. It is an out-of-distribution dataset to improve the robustness against texts drastically different from the train and val datasets.