Open Liujingxiu23 opened 4 years ago
你可以去检查中文语料alignment的正确性,或者处理音频方式是否一致
The text feature I used before diff from the origin English version, when I change back to the orgin text feature (one phone-one id), the pronunciation improve much. But there are still two questions that need your help: 1) The phone right after sp or sil is not good,what to do to slove this problem? 2) how to add other text feature, for example prosody level? Add another embeding tatble, as src_seq and src_pos?
The text feature I used before diff from the origin English version, when I change back to the orgin text feature (one phone-one id), the pronunciation improve much. But there are still two questions that need your help:
- The phone right after sp or sil is not good,what to do to slove this problem?
- how to add other text feature, for example prosody level? Add another embeding tatble, as src_seq and src_pos?
I am sorry that I haven't used phone as input so I counldn't answer your first question. As for the second, I knew that many different methods to add more information to tts model, such as gst-tacotron, mellotron. Maybe you can read them for inspiration.
@Liujingxiu23 你的中文模型的对齐信息是pytorch版本的tactron生成的吗?我用pytorch版本的tactron生成的对齐信息全是nan,请问你知道原因吗?
@Liujingxiu23 你的中文模型的对齐信息是pytorch版本的tactron生成的吗?我用pytorch版本的tactron生成的对齐信息全是nan,请问你知道原因吗?
我有一样的问题,用的是自己的英文的数据
@yangchunyong @fnyhy I have interval labels of each wav which have definite phone time boundary. And I found this phenomenon did not occur on every TTS dataset, to some dataset , thre result is ok, the pronunciation is totally right, but to another dataset, the pronunciation is bad sometimes. I donot know why since the process of lingustic information as wel as the wav is all the same.
@xcmyz I tried your laest code, the acoustic quality improve much, early the same as tacotron2 I think. The TTS corpus I use is chinese, and I keep the default hparams setting. My loss seems not as good as yours, the postnet-mel-loss converge to about 0.5, the duration loss about 0.8. I don not know why? And the pronunciation as well as the tone is not that good. For example, in same wavs "zhang" read like "zhan", “tao3” read like “tao2” ,why this happens? Do you have any suggest to solve this problem?