yl4579 / StyleTTS2

StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models
MIT License
4.97k stars 419 forks source link

Issue with impropper pauses and random bursts of noise #233

Open king-dahmanus opened 6 months ago

king-dahmanus commented 6 months ago

Hello there, devs of Style TTS2, it's a great model, you really did a good job. I mainly use it on the hf demo, but there are some issues: Firstly, it pauses after the dash - symbol, so please fix it. For example, it reads white-clothed as "White. Clothed". Secondly, sometimes it does random bursts of distorted noise, skipping words. Can you find a way to fix this? Is this an issue of the pretrained model or the architecture itself? Thanks and regards

martinambrus commented 2 months ago

The model is trained on flawed LJSpeech and LibriTTS data. They both contain errors in their transcriptions - one of which is that incorrect pause on dash. You'll need to train your own model using your own data to mitigate this. You can get some more valuable info about the training / finetuning in this discussion thread, too.

dnlzsy commented 2 weeks ago

Can I download the data from "https://keithito.com/LJ-Speech-Dataset/" and train it?

martinambrus commented 2 weeks ago

Can I download the data from "https://keithito.com/LJ-Speech-Dataset/" and train it?

Yes, you could - however bear in mind that the dataset is flawed and you'd end up with exactly the same problems as the sample model of StyleTTS2 has.