StyleTTS 2 surpasses human recordings on the single-speaker LJSpeech dataset and matches it on the multispeaker VCTK dataset as judged by native English speakers. Moreover, when trained on the LibriTTS dataset, our model outperforms previous publicly available models for zero-shot speaker adaptation
The abstract seems to suggest that the ljspeech model is better. should I finetune using the ljspeech or libritts model?
The abstract seems to suggest that the ljspeech model is better. should I finetune using the ljspeech or libritts model?