As a Neural Vocoder, LPCNet makes good quality for GT mel-spec. inversion. But, in case of TTS, there is some noise in my experiments.
I think acoustic feature mismatch makes noise in TTS case. So, I test jointly fine-tuning JDI-t(https://arxiv.org/pdf/2005.07799.pdf, or FastSpeech2) and e2e LPCNet, same as "joint-ft" in ESPNET-TTS2(https://arxiv.org/pdf/2110.07840.pdf) paper.
But, result is not good.
Does anyone have experience in jointly training of e2e LPCNet?
As a Neural Vocoder, LPCNet makes good quality for GT mel-spec. inversion. But, in case of TTS, there is some noise in my experiments. I think acoustic feature mismatch makes noise in TTS case. So, I test jointly fine-tuning JDI-t(https://arxiv.org/pdf/2005.07799.pdf, or FastSpeech2) and e2e LPCNet, same as "joint-ft" in ESPNET-TTS2(https://arxiv.org/pdf/2110.07840.pdf) paper.
But, result is not good.
Does anyone have experience in jointly training of e2e LPCNet?