yl4579 / StyleTTS2

StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models
MIT License
5k stars 423 forks source link

Training a Japanese model, pitch accent and IPA #186

Closed NielsVandenEynde closed 8 months ago

NielsVandenEynde commented 10 months ago

First of all, thanks for this awesome research, voice cloning desperately needs to be open sourced.

I'm interested in training a Japanese model, I have over a thousand hours of speech data.

However I'm a bit concerned about having to convert my transcriptions to IPA. Japanese has a pitch accent, with pitches possibly changing throughout a word. For example 橋、箸 are both pronounced as "hashi", but the pitch change is different for them. However when converting text to IPA, such as in this topic, this information is lost. Is there a way you can train a model with just the "raw" text? Besides from that, I just need to train/find a Japanese Bert model right? Any other things I should be aware of?

Thanks in advance

yl4579 commented 8 months ago

Sorry for the late reply because I am very busy recently, but for the pitch in Japanese you may refer to https://github.com/yl4579/StyleTTS/issues/10#issuecomment-1407789937. The pitch can easily be extracted from OpenJTalk return: https://github.com/yl4579/PL-BERT/issues/6#issuecomment-1797869275