Training a Japanese model, pitch accent and IPA

yl4579 / StyleTTS2

StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models

MIT License

5k stars 423 forks source link

First of all, thanks for this awesome research, voice cloning desperately needs to be open sourced.

I'm interested in training a Japanese model, I have over a thousand hours of speech data.

However I'm a bit concerned about having to convert my transcriptions to IPA. Japanese has a pitch accent, with pitches possibly changing throughout a word. For example 橋、箸 are both pronounced as "hashi", but the pitch change is different for them. However when converting text to IPA, such as in this topic, this information is lost. Is there a way you can train a model with just the "raw" text? Besides from that, I just need to train/find a Japanese Bert model right? Any other things I should be aware of?

Thanks in advance

yl4579 / StyleTTS2

Training a Japanese model, pitch accent and IPA #186