yl4579 / StyleTTS2

StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models
MIT License
4.95k stars 417 forks source link

Use XPhoneBERT instead of the provided PL-BERT checkpoints. #140

Closed rumbleFTW closed 11 months ago

rumbleFTW commented 11 months ago

Since it is mentioned that we can use XPhoneBert instead of the provided PL-BERT checkpoints for better multi-lingual inference, could you shed some light on how to load the XphoneBert checkpoints to infer using that? Thanks!

yl4579 commented 11 months ago

See #28 The quality will be much worse because XPhoneBERT uses charsiug2p and is trained solely on phonemes unlike PL-BERT, so you should keep it in mind. I would suggest you wait for #41 instead.

rumbleFTW commented 11 months ago

Alright, thanks for the clarification @yl4579 !

Actually I wanted StyleTTS2 to work in my local language. I thought it was a drop-in replacement from the readme :sweat_smile: What do you think would be my best bet? Should I train PL-BERT from scratch using my own data? If yes how much data would and training time would be sufficient for yielding good results?

Thanks!

yl4579 commented 11 months ago

I think you could either not use pre-trained PL-BERT and initialize a BERT model from scratch (like https://github.com/yl4579/StyleTTS2/issues/139#issuecomment-1849280509) or to maximize the quality you can also train your own PL-BERT. I think you can just use wikipedia as the training corpus for your language. I am currently collecting data and in the process of training multilingual PL-BERT, see #41