xiph / LPCNet

Efficient neural speech synthesis
BSD 3-Clause "New" or "Revised" License
1.13k stars 295 forks source link

Using with Tacotron2 #52

Open ArnaudWald opened 5 years ago

ArnaudWald commented 5 years ago

Hello,

I would like to connect a Tacotron2 model to LPCNet. Is there a way to convert the 80-mel coefficients (output of Taco2) into the 18 Bark scale + 2 pitch parameters (input of LPCNet) ?

And somehow related, when reading about the Bark scale like here on wikipedia, there is usually 24 coefficients, and I don't understand how they are only 18 computed here. Even taking into account the 16kHz sampling, that would leave 22 of them, right ?

Thanks a lot :)

superhg2012 commented 5 years ago

I am using Tacotron2 to predict 20 dim features for LPCNet. But there is noize in the synthesized audio.

lyz04551 commented 5 years ago

我正在使用Tacotron2来预测LPCNet的20个暗淡特征。但合成音频中存在噪音。

Is there any way to improve the sound quality?

byuns9334 commented 5 years ago

@superhg2012 I get the same problem, did you solve it?

carlfm01 commented 5 years ago

I've tried with current master of tacotron2 and LPCTron but failed.

With an adaption of my fork using the correct hparams I'm generating high quality speech audios.zip

My fork with spanish branch + MlWoo adaption of LPCNet, you need to change your path and symbols, see the commit history: https://github.com/carlfm01/Tacotron-2/tree/spanish

byuns9334 commented 5 years ago

@carlfm in your fork, could you let me know how to generate wav from f32 feature? and is it as same speed as original LPCNet?

carlfm01 commented 5 years ago

how to generate wav from f32 feature? and is it as same speed as original LPCNet?

The tacotron repo is to predict the feature not the wav, to generate the wav with the predicted feature by tacotron, you need to use https://github.com/mlwoo/LPCNet fork

And for me, using sparsity of 200 is 3x faster than real time with AVX enabled

byuns9334 commented 5 years ago

@carlfm01 I tried https://github.com/mlwoo/LPCNet fork already, but it generates wav too much noise, as I described in https://github.com/MlWoo/LPCNet/issues/6. How did you solve this problem? any suggestions please?

carlfm01 commented 5 years ago

Noise using predicted features by tacotron or using the real features?

byuns9334 commented 5 years ago

@carlfm01 using the real features. so I converted real wav -> (by ./dump_data) s16 -> (./test_lpcnet) f32 -> (by ffmpeg) wav, as explained in MlWoo's repo. It is supposed to convert the f32 back to original wav, but noise is severe (it contains original voice though). Have you experienced this? When you used MlWoo, were speed and audio quality both perfect? If yes, What did you modify from MlWoo's code? Thank you so much for help.

carlfm01 commented 5 years ago

were speed and audio quality both perfect

Yes.

What did you modify from MlWoo's code?

Nothing.

My only guess is that may you made a mistake compiling your exported weights?

https://github.com/mozilla/LPCNet/issues/58#issuecomment-533470433

carlfm01 commented 5 years ago

Using MlWoo's fork: feature.zip

byuns9334 commented 5 years ago

@carlfm01 Thanks. Let me explain what i did so far in detail.

so now, I have to repositories : LPCNet (original LPCNet repo), LPCNet_MlWoo.

I trained LPCNet and got the nnetdata* files in LPCNet/src directory. And I moved all of them to LPCNet_MlWoo/src, because when I tried './dump_lpcnet.py lpcnet15_384_10_G16_64.h5' (in LPCNet_MlWoo repo), it didn't work (because of some weird model shape error.). (lpcnet15_384_10_G16_64.h5 model was generated in original LPCNet repo)

and I ran i just ran 'make dump_data taco=1' and 'make test_lcpnet taco=1' .

Do you think these make sense? (I didn't change any parameter of LPCNet and LPCNet_MlWoo)

carlfm01 commented 5 years ago

model was generated in original LPCNet repo

Thats the issue, I'm afraid you need to retrain using MlWoo fork, I did not trained with LPCNet(this repo)

byuns9334 commented 5 years ago

@carlfm01 but Is there any difference between MLWoo's LPCNet training code and original LPCNet's LPCNet training code? aren't they exactly same?

byuns9334 commented 5 years ago

@carlfm01 so you did everything (such as train LPCNet and inference the audio and etc) in MLWoo's repo, right? Which hyperparameters/options did you change?

carlfm01 commented 5 years ago

but Is there any difference between MLWoo's LPCNet training code and original LPCNet's LPCNet training code? aren't they exactly same?

No, otherwise you will be able to load models from both. I also tried and throw an error about a missing layer or an extra layer, I can't recall. The inference code is also different.

so you did everything (such as train LPCNet and inference the audio and etc) in MLWoo's repo, right?

Yes, default.

The only thing that I changed was the training code to load checkpoints and adapt on new data.

This is missing on LPCNet_MlWoo

https://github.com/mozilla/LPCNet/blob/master/src/train_lpcnet.py#L106-L125

byuns9334 commented 5 years ago

@carlfm01 okay, thank you so much. I will try. and you are insisting that when merging Tacotron2 + LPCNet, I better use your spanish fork for tacotron2 right?

carlfm01 commented 5 years ago

@carlfm01 okay, thank you so much. I will try. and you are insisting that when merging Tacotron2 + LPCNet, I better use your spanish fork for tacotron2 right?

Yes, just change your paths and symbols, see the commit history to understand better. I've tried LPCTron and the tacotron master but both failed generating noisy speech.

byuns9334 commented 5 years ago

@carlfm01 thank you so much 🙏. Wish you all the best. i will text u again when i get other questions

carlfm01 commented 5 years ago

And share your results! 👍

byuns9334 commented 5 years ago

@carlfm01 Hi, I followed all your instructions (re-train from MlWoo's repo) and now I've trained 6 epochs for test. the original wav is about 3 seconds long, but generated audio is about 8 seconds long. Have you experienced this problem?

carlfm01 commented 5 years ago

Hello, no, I'm getting the same duration. Is it from real features?

byuns9334 commented 5 years ago

@carlfm01 yes real features. Also I did './test_lpcnet ~.h5' well. This issue is strange.... I'll take a look more. thanks !

byuns9334 commented 5 years ago

@carlfm01 Are sample rate, precision, sample encoding of your training wav files 16000, 16bit, 16-bit singed integer pcm?

carlfm01 commented 5 years ago

Please make sure using make test_lpcnet taco=1 if you extracted the features with taco enabled on the ./dump_data, or disable taco for both

Yes, 16000, 16bit, mono

byuns9334 commented 5 years ago

@carlfm01 I just ran both 'make dump_data taco=1' and 'make test_lpcnet taco=1', so they are both up-to-date.

carlfm01 commented 5 years ago

What about quality? You get the same result cleaning and testing without taco? please also make sure you do make clean.

byuns9334 commented 5 years ago

@carlfm01 If i want to do them without taco, should I do 'make dump_data' and 'make test_lpcnet' instead of 'make dump_data taco=1' and 'make test_lpcnet taco=1' ?

and yes, I think I did make clean

carlfm01 commented 5 years ago

If i want to do them without taco, should I do 'make dump_data' and 'make test_lpcnet' instead of 'make dump_data taco=1' and 'make test_lpcnet taco=1' ?

Yes.

byuns9334 commented 5 years ago

@carlfm01 It works now. incredible. The problem was that I didn't do make clean at very first step. Generated audio samples are extremely clean and inference speed is much faster than realtime. I will upload test results in few minutes here. Only suspicious thing is that this works perfectly even with 6 epochs training .... Thank you so much

byuns9334 commented 5 years ago

@carlfm01 Here are the result samples. If there is something strange, please let me know! samples.zip

byuns9334 commented 5 years ago

@carlfm01 As you told me yesterday, I have to use https://github.com/carlfm01/Tacotron-2/tree/spanish for tacotron2 training.

But when I save the f32 features into npy file, do I really need to resize it? Why can't it be just reshape in here?

mel_target = np.fromfile(os.path.join(self._mel_dir, meta[0]), dtype='float32') mel_target = np.resize(mel_target, (-1, self._hparams.num_mels))

Why can't it be just mel_target = np.fromfile(os.path.join(self._mel_dir, meta[0]), dtype='float32') mel_target = np.reshape(mel_target, (-1, self._hparams.num_mels)) ?

And which one did you use?

carlfm01 commented 5 years ago

@carlfm01 Here are the result samples. If there is something strange, please let me know! samples.zip

Sounds really good.

It was the recommendation from MlWoo's Readme

https://github.com/mlwoo/LPCNet

Since reshape and resize uses different behavior I don't know the implications of changing to reshape.

mel_target = np.fromfile(os.path.join(self._mel_dir, meta[0]), dtype='float32') mel_target = np.resize(mel_target, (-1, self._hparams.num_mels))

And which one did you use?

resize.

byuns9334 commented 5 years ago

@carlfm01 thank you. I have one last question ... Is it not possible to generate 2 pitch parameters from spectrogram only through signal-processing (not machine learning)? If it is possible, we can just train tacotron2 to output spectrogram (setting of original paper), convert to 18 BFCC + 2 pitch parameters through signal processing, and then generate wav. (I know current objective is to train tacotron2 to output f32 features for now)

carlfm01 commented 5 years ago

Sorry, applying changes by signal-processing is out of my knowledge.

byuns9334 commented 5 years ago

@carlfm01 okay, thanks!

dalvlv commented 5 years ago

@carlfm01 Hi,I‘m back. I'm trying TTS. I use taco2 to predict 20 dim features and then trans 20dim to 55dim .f32 features with zeros padding。 But the synthesis audio with lpcnet is all silence. Do you know why and how to calculate 55dim features from 20dim predicted features? Thank you.

carlfm01 commented 5 years ago

to 55dim .f32 features with zeros padding

Hi, why do you want to do that? If you enabled taco=1 you don't need 50d but 20d, if you are using my fork the generated 20d are ready to feed into LPCNet with taco=1 enabled.

carlfm01 commented 5 years ago

Ohhh I see @dalvlv your are no the same, please read the conversation about my fork.

dalvlv commented 5 years ago

@carlfm01 OK, let me have a try. Could you give a link of your forked repo?

carlfm01 commented 5 years ago

I've tried with current master of tacotron2 and LPCTron but failed.

With an adaption of my fork using the correct hparams I'm generating high quality speech audios.zip

My fork with spanish branch + MlWoo adaption of LPCNet, you need to change your path and symbols, see the commit history: https://github.com/carlfm01/Tacotron-2/tree/spanish

@dalvlv please read the whole thread

dalvlv commented 5 years ago

@carlfm01 I try MlWoo repo to synthesis audio with taco2 predicted features. But it has too much noise. I use my own trained lpcnet model . Do I need retrain lpcnet using this repo? Or maybe my taco2 train has some problem? test-out.zip

carlfm01 commented 5 years ago

You need to retrain both

dalvlv commented 5 years ago

@carlfm01 Hi, I have solved my problem caused by data type transfer. Thank you all the same for your kind help.

carlfm01 commented 4 years ago

Hello @dalvlv, any news?

byuns9334 commented 4 years ago

Hello @carlfm01 , how are you? hope you are doing fine. I have a simple quick question: How many epochs did you train your fork of tacotron2 (maybe this one? https://github.com/carlfm01/Tacotron-2/tree/spanish) ? LPCNet is okay now, but sound quality of trained Tacotron2+LPCNet is very poor. :(

carlfm01 commented 4 years ago

Hello @carlfm01 , how are you?

Really good thanks,

How many epochs did you train your fork of tacotron2

About 47k steps.

but sound quality of trained Tacotron2+LPCNet is very poor

Did you make sure using 16KHz 16bit mono while removing the headers and extracting the features?

byuns9334 commented 4 years ago

@carlfm01 yes we are using 16kHz 16bit mono for training. seems strange.... num-mel of features (for LPCNet input) are all 20 right?

carlfm01 commented 4 years ago

num-mel of features (for LPCNet input) are all 20 right

Yes, all the hparams are correct, can you share an example of the audio?

Can you test this audio file for inference? https://gist.github.com/carlfm01/5d6ad719810412934d57bdbe1ce8b5b6

byuns9334 commented 4 years ago

1010.zip @carlfm01 here are the samples. also, all hyperparameters should be exactly same with everything with your code, right? Do you mean to try this code https://gist.github.com/carlfm01/5d6ad719810412934d57bdbe1ce8b5b6 for inference?