Requesting a quick heal with a few queries at hand

yl4579 / PL-BERT

Phoneme-Level BERT for Enhanced Prosody of Text-to-Speech with Grapheme Predictions

MIT License

218 stars 40 forks source link

Went through few of your answers in the issues, would like to know: 1) whether your suggested modifications for VITS and FastSeech 2 models apply only for inference or for finetuning too? 2) in the readme section you have mentioned some changes to be made in the train_second.py of styleTTS if one wishes to include PL-BERT. Does that mean train_first.py can be skilled at all or the stage one has to be done without PL-Bert. I am only interested in finetuning the pre-trained model 3) the first modification https://github.com/yl4579/StyleTTS/blob/main/models.py#L683, it points to the line where there is the discriminator is instantiated, which will an argument for Munch(), but the replacement code doesn't instantiate discriminator anywhere

I suppose either the line pointed here has to be kept as is or the replaced Munch instantiation shall not include discriminator. What's the correct implementation, if you'd like to tell? 4) Any final comments in regards to comparing styleTTS directly with Tortoise in terms of inference quality and speed?

You have to train VITS and FastSpeech 2 (or any TTS models) from scratch if you want to use a different text encoder than the original one (including PL-BERT). So the answer to your question is neither, you need to train the TTS model from scratch.
The first stage is independent of PL-BERT, as it only trains an acoustic model. The PL-BERT is only used for prosody and duration prediction. This does not work for VITS and FastSpeech 2 though, as both models are end-to-end that do not train an acoustic module first then train a predictor module like StyleTTS does.
The original intention was to copy-paste the provided code snippet after the discriminator line. However, if you do not understand how to modify the code, you can refer to the zipped file with all the modified code: https://drive.google.com/file/d/18DU4JrW1rhySrIk-XSxZkXt2MuznxoM-/view
If you are referring to https://nonint.com/static/tortoise_v2_examples.html, I believe StyleTTS is better, but it also depends on the dataset. If you are interested you can also refer to our latest work StyleTTS 2 here: https://styletts2.github.io/. The code will be made publicly available by the end of this month.

yl4579 / PL-BERT

Requesting a quick heal with a few queries at hand #15