Open ArnaudWald opened 5 years ago
I am using Tacotron2 to predict 20 dim features for LPCNet. But there is noize in the synthesized audio.
我正在使用Tacotron2来预测LPCNet的20个暗淡特征。但合成音频中存在噪音。
Is there any way to improve the sound quality?
@superhg2012 I get the same problem, did you solve it?
I've tried with current master of tacotron2 and LPCTron but failed.
With an adaption of my fork using the correct hparams I'm generating high quality speech audios.zip
My fork with spanish branch + MlWoo adaption of LPCNet, you need to change your path and symbols, see the commit history: https://github.com/carlfm01/Tacotron-2/tree/spanish
@carlfm in your fork, could you let me know how to generate wav from f32 feature? and is it as same speed as original LPCNet?
how to generate wav from f32 feature? and is it as same speed as original LPCNet?
The tacotron repo is to predict the feature not the wav, to generate the wav with the predicted feature by tacotron, you need to use https://github.com/mlwoo/LPCNet fork
And for me, using sparsity of 200 is 3x faster than real time with AVX enabled
@carlfm01 I tried https://github.com/mlwoo/LPCNet fork already, but it generates wav too much noise, as I described in https://github.com/MlWoo/LPCNet/issues/6. How did you solve this problem? any suggestions please?
Noise using predicted features by tacotron or using the real features?
@carlfm01 using the real features. so I converted real wav -> (by ./dump_data) s16 -> (./test_lpcnet) f32 -> (by ffmpeg) wav, as explained in MlWoo's repo. It is supposed to convert the f32 back to original wav, but noise is severe (it contains original voice though). Have you experienced this? When you used MlWoo, were speed and audio quality both perfect? If yes, What did you modify from MlWoo's code? Thank you so much for help.
were speed and audio quality both perfect
Yes.
What did you modify from MlWoo's code?
Nothing.
My only guess is that may you made a mistake compiling your exported weights?
https://github.com/mozilla/LPCNet/issues/58#issuecomment-533470433
Using MlWoo's fork: feature.zip
@carlfm01 Thanks. Let me explain what i did so far in detail.
so now, I have to repositories : LPCNet (original LPCNet repo), LPCNet_MlWoo.
I trained LPCNet and got the nnetdata* files in LPCNet/src directory. And I moved all of them to LPCNet_MlWoo/src, because when I tried './dump_lpcnet.py lpcnet15_384_10_G16_64.h5' (in LPCNet_MlWoo repo), it didn't work (because of some weird model shape error.). (lpcnet15_384_10_G16_64.h5 model was generated in original LPCNet repo)
and I ran i just ran 'make dump_data taco=1' and 'make test_lcpnet taco=1' .
Do you think these make sense? (I didn't change any parameter of LPCNet and LPCNet_MlWoo)
model was generated in original LPCNet repo
Thats the issue, I'm afraid you need to retrain using MlWoo fork, I did not trained with LPCNet(this repo)
@carlfm01 but Is there any difference between MLWoo's LPCNet training code and original LPCNet's LPCNet training code? aren't they exactly same?
@carlfm01 so you did everything (such as train LPCNet and inference the audio and etc) in MLWoo's repo, right? Which hyperparameters/options did you change?
but Is there any difference between MLWoo's LPCNet training code and original LPCNet's LPCNet training code? aren't they exactly same?
No, otherwise you will be able to load models from both. I also tried and throw an error about a missing layer or an extra layer, I can't recall. The inference code is also different.
so you did everything (such as train LPCNet and inference the audio and etc) in MLWoo's repo, right?
Yes, default.
The only thing that I changed was the training code to load checkpoints and adapt on new data.
This is missing on LPCNet_MlWoo
https://github.com/mozilla/LPCNet/blob/master/src/train_lpcnet.py#L106-L125
@carlfm01 okay, thank you so much. I will try. and you are insisting that when merging Tacotron2 + LPCNet, I better use your spanish fork for tacotron2 right?
@carlfm01 okay, thank you so much. I will try. and you are insisting that when merging Tacotron2 + LPCNet, I better use your spanish fork for tacotron2 right?
Yes, just change your paths and symbols, see the commit history to understand better. I've tried LPCTron and the tacotron master but both failed generating noisy speech.
@carlfm01 thank you so much 🙏. Wish you all the best. i will text u again when i get other questions
And share your results! 👍
@carlfm01 Hi, I followed all your instructions (re-train from MlWoo's repo) and now I've trained 6 epochs for test. the original wav is about 3 seconds long, but generated audio is about 8 seconds long. Have you experienced this problem?
Hello, no, I'm getting the same duration. Is it from real features?
@carlfm01 yes real features. Also I did './test_lpcnet ~.h5' well. This issue is strange.... I'll take a look more. thanks !
@carlfm01 Are sample rate, precision, sample encoding of your training wav files 16000, 16bit, 16-bit singed integer pcm?
Please make sure using make test_lpcnet taco=1
if you extracted the features with taco enabled on the ./dump_data, or disable taco for both
Yes, 16000, 16bit, mono
@carlfm01 I just ran both 'make dump_data taco=1' and 'make test_lpcnet taco=1', so they are both up-to-date.
What about quality? You get the same result cleaning and testing without taco? please also make sure you do make clean
.
@carlfm01 If i want to do them without taco, should I do 'make dump_data' and 'make test_lpcnet' instead of 'make dump_data taco=1' and 'make test_lpcnet taco=1' ?
and yes, I think I did make clean
If i want to do them without taco, should I do 'make dump_data' and 'make test_lpcnet' instead of 'make dump_data taco=1' and 'make test_lpcnet taco=1' ?
Yes.
@carlfm01 It works now. incredible. The problem was that I didn't do make clean
at very first step. Generated audio samples are extremely clean and inference speed is much faster than realtime. I will upload test results in few minutes here. Only suspicious thing is that this works perfectly even with 6 epochs training .... Thank you so much
@carlfm01 Here are the result samples. If there is something strange, please let me know! samples.zip
@carlfm01 As you told me yesterday, I have to use https://github.com/carlfm01/Tacotron-2/tree/spanish for tacotron2 training.
But when I save the f32 features into npy file, do I really need to resize
it? Why can't it be just reshape
in here?
mel_target = np.fromfile(os.path.join(self._mel_dir, meta[0]), dtype='float32') mel_target = np.resize(mel_target, (-1, self._hparams.num_mels))
Why can't it be just mel_target = np.fromfile(os.path.join(self._mel_dir, meta[0]), dtype='float32') mel_target = np.reshape(mel_target, (-1, self._hparams.num_mels)) ?
And which one did you use?
@carlfm01 Here are the result samples. If there is something strange, please let me know! samples.zip
Sounds really good.
It was the recommendation from MlWoo's Readme
https://github.com/mlwoo/LPCNet
Since reshape and resize uses different behavior I don't know the implications of changing to reshape.
mel_target = np.fromfile(os.path.join(self._mel_dir, meta[0]), dtype='float32') mel_target = np.resize(mel_target, (-1, self._hparams.num_mels))
And which one did you use?
resize.
@carlfm01 thank you. I have one last question ... Is it not possible to generate 2 pitch parameters from spectrogram only through signal-processing (not machine learning)? If it is possible, we can just train tacotron2 to output spectrogram (setting of original paper), convert to 18 BFCC + 2 pitch parameters through signal processing, and then generate wav. (I know current objective is to train tacotron2 to output f32 features for now)
Sorry, applying changes by signal-processing is out of my knowledge.
@carlfm01 okay, thanks!
@carlfm01 Hi,I‘m back. I'm trying TTS. I use taco2 to predict 20 dim features and then trans 20dim to 55dim .f32 features with zeros padding。 But the synthesis audio with lpcnet is all silence. Do you know why and how to calculate 55dim features from 20dim predicted features? Thank you.
to 55dim .f32 features with zeros padding
Hi, why do you want to do that? If you enabled taco=1 you don't need 50d but 20d, if you are using my fork the generated 20d are ready to feed into LPCNet with taco=1 enabled.
Ohhh I see @dalvlv your are no the same, please read the conversation about my fork.
@carlfm01 OK, let me have a try. Could you give a link of your forked repo?
I've tried with current master of tacotron2 and LPCTron but failed.
With an adaption of my fork using the correct hparams I'm generating high quality speech audios.zip
My fork with spanish branch + MlWoo adaption of LPCNet, you need to change your path and symbols, see the commit history: https://github.com/carlfm01/Tacotron-2/tree/spanish
@dalvlv please read the whole thread
@carlfm01 I try MlWoo repo to synthesis audio with taco2 predicted features. But it has too much noise. I use my own trained lpcnet model . Do I need retrain lpcnet using this repo? Or maybe my taco2 train has some problem? test-out.zip
You need to retrain both
@carlfm01 Hi, I have solved my problem caused by data type transfer. Thank you all the same for your kind help.
Hello @dalvlv, any news?
Hello @carlfm01 , how are you? hope you are doing fine. I have a simple quick question: How many epochs did you train your fork of tacotron2 (maybe this one? https://github.com/carlfm01/Tacotron-2/tree/spanish) ? LPCNet is okay now, but sound quality of trained Tacotron2+LPCNet is very poor. :(
Hello @carlfm01 , how are you?
Really good thanks,
How many epochs did you train your fork of tacotron2
About 47k steps.
but sound quality of trained Tacotron2+LPCNet is very poor
Did you make sure using 16KHz 16bit mono while removing the headers and extracting the features?
@carlfm01 yes we are using 16kHz 16bit mono for training. seems strange.... num-mel of features (for LPCNet input) are all 20 right?
num-mel of features (for LPCNet input) are all 20 right
Yes, all the hparams are correct, can you share an example of the audio?
Can you test this audio file for inference? https://gist.github.com/carlfm01/5d6ad719810412934d57bdbe1ce8b5b6
1010.zip @carlfm01 here are the samples. also, all hyperparameters should be exactly same with everything with your code, right? Do you mean to try this code https://gist.github.com/carlfm01/5d6ad719810412934d57bdbe1ce8b5b6 for inference?
Hello,
I would like to connect a Tacotron2 model to LPCNet. Is there a way to convert the 80-mel coefficients (output of Taco2) into the 18 Bark scale + 2 pitch parameters (input of LPCNet) ?
And somehow related, when reading about the Bark scale like here on wikipedia, there is usually 24 coefficients, and I don't understand how they are only 18 computed here. Even taking into account the 16kHz sampling, that would leave 22 of them, right ?
Thanks a lot :)