yl4579 / StyleTTS2

StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models
MIT License
4.92k stars 411 forks source link

slmadv using differentiable duration modeling may not be helpful and even bad #146

Closed liuhuang31 closed 11 months ago

liuhuang31 commented 11 months ago

I use chinese data to train and remove pl-bert mudule. It is normal until training to stage2 joint train, which train slmadv using differentiable duration. This causes the model to collapse and there are problems with the synthesized audio pronunciation.

yl4579 commented 11 months ago

Which SLM model are you using?

liuhuang31 commented 11 months ago

microsoft/wavlm-base-plus

yl4579 commented 11 months ago

Yes it won't work because it's an English model. See #70

jarred1989 commented 9 months ago

@liuhuang31 Hi, How did you handle this SLM issue? Also, I found that the styletts2 audio shared by you in #139 sound good, is there anything different with here?

liuhuang31 commented 9 months ago

@jarred1989 Hi, jarred1989. For the origin code slmadv using differentiable duration modeling, i cant got a good result, its seems not helpful for me. So i not use it, and change to the normal duration modeling as before.

jarred1989 commented 9 months ago

got it! thx