yl4579 / StyleTTS2

StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models
MIT License
4.71k stars 382 forks source link

Contextual learning #34

Closed martinambrus closed 10 months ago

martinambrus commented 10 months ago

Sorry if this question is more generic but I'm fairly new to the TTS field and I'm not sure if what I want to ask is a project-specific thing or a TTS-specific thing.

I'd like to ask how much is the context of sentences in WAV files important for training.

I'm working on my own set of WAV files using the LJSpeech structure. While testing this project's demo output by using the model generously provided by you (thank you!), I realized that the output is sometimes sounding "wrong" at the end of a sentence.

In some sentences, the voice is going up, as if there was a comma or a questionmark at the end instead of a full-stop. In other sentences this does not occur.

When I played back some of the LJSpeech sentences used for training, I found that this is exactly the same.

What I'm not sure about is whether the model learns to make the same mistake by the context of the sentence itself or it's just repeating the same thing because there are the same / similar words towards the end of the sentence?

I'm trying to understand how to best create my WAV files, so the model is trained well with regards to the emotional context of the sentence.

Example: "Oh my god! How did that happen?!", exclaimed Anna with a tone of irritation in her voice.

If I use this sentence as a whole, would the TTS learn to use an irritated surprise emotion where the context of irritation is present? Or does this not matter and the model would only learn the irritated tone from the sound in that quote, regardless of the context following it?

Thanks for reading and sorry again for a super-long question!

yl4579 commented 10 months ago

Yes, this is one major problem of the current TTS model which treats each sentence independently. In reality, the audio clips are always dependent on its context (previous or next sentence), so you may also want to use that information in training. You may want to add the style of the previous sentence along with the style of the current sentence to synthesize the current sentence. This is one way to achieve context-aware TTS.

As for your example, if you feed the sentence as a whole, of course it can learn the emotion this way. One problem of LJSpeech dataset is that it cuts the audio arbitrarily by silence, which is why you see some seemingly unnatural intonation and also why our model is even better than ground truth recording.

Quentin1168 commented 8 months ago

Hello, I have been thinking of your suggestion for achieving context-aware TTS. To employ this method, would it be done during the training? or during inference? I am pretty new to this, but this has piqued my interest. Thank you.