xiph / LPCNet

Efficient neural speech synthesis
BSD 3-Clause "New" or "Revised" License
1.14k stars 295 forks source link

Using with Tacotron2 #52

Open ArnaudWald opened 5 years ago

ArnaudWald commented 5 years ago

Hello,

I would like to connect a Tacotron2 model to LPCNet. Is there a way to convert the 80-mel coefficients (output of Taco2) into the 18 Bark scale + 2 pitch parameters (input of LPCNet) ?

And somehow related, when reading about the Bark scale like here on wikipedia, there is usually 24 coefficients, and I don't understand how they are only 18 computed here. Even taking into account the 16kHz sampling, that would leave 22 of them, right ?

Thanks a lot :)

carlfm01 commented 5 years ago

Do you mean to try this code https://gist.github.com/carlfm01/5d6ad719810412934d57bdbe1ce8b5b6 for inference?

Yes, for some voices it gets better

Your audio sounds like it is an attention issue from tacotron 2, can you share the attention plot?

carlfm01 commented 5 years ago

step-32500-align

Here's my alignment plot as reference for 32k steps

byuns9334 commented 5 years ago

@carlfm01 we are not outputting attention alignments for now, so I will let you know again when we finish re-training tacotron2.

byuns9334 commented 5 years ago

@carlfm01 thank you so much for help again !

byuns9334 commented 5 years ago

@carlfm01 everything worked fine. Thank you so much. Here is the attention plot: step-30000-align

and here are the samples from our tacotron2 + lpcnet:

1011.zip

hope you have a great day !

carlfm01 commented 5 years ago

Nice! Sounds good, I think it can be better by trimming silence and play around with your lr and lr decay, or scheduled mode(I want to try it soon)

Here's a new speaker using the old one. voice adapt.zip

byuns9334 commented 5 years ago

@carlfm01 Thanks for sharing samples! but what do you mean by 'new speaker' and 'voice adapt'? Could you please explain in more detail?

Also, Have you tried multi-speaker Tacotron2 + LPCNet? (I've been trying on single-speaker Tactron2 + LPCNet so far, which eventually works fine) Did it work well?

Maxxiey commented 5 years ago

@byuns9334 hi, in hparams.py(carlfm01's Tacotron2 repo) U may find this:

tacotron_fine_tuning = False

and according to its comment:

Set to True to freeze encoder and only keep training pretrained decoder. Used for speaker adaptation with small data.

I don't have time to give it a try, but i think it could save you a lot of time when you already have a good model by fine tuning a few layers of the network,

hope it helps :D

carlfm01 commented 5 years ago

but what do you mean by 'new speaker' and 'voice adapt'? Could you please explain in more detail?

Means fine-tuning a trained model on new speaker voice.

Also, Have you tried multi-speaker Tacotron2 + LPCNet?

No

tacotron_fine_tuning = False

The code to stop the gradient is broken, if you try to fine tune a model saved with fine_tuning=True it will fail. It needs a review :)

dalvlv commented 5 years ago

@carlfm01 Hi, I spent a holiday and have come back.

synthesis_sample.zip

byuns9334 commented 5 years ago

@Maxxiey thanks, I will give it a try and tell you about the result

@carlfm01 okay I understand. I have some questions:

  1. How much time does it take to adapt to new voice? (time for training and inference)

  2. Is adapting to several people's voices possible? (like 2~3 people, or 100 people)

Thanks!

carlfm01 commented 5 years ago

Hello, sorry for the delay.

* Sounds very good.

synthesis_sample.zip

Yeah, sounds good.

* Do you have any idea to optimize that?

Code your own code to load the model once a reuse session?

1. Taco2 sythesis speed.  For now, I think the taco2 sythesis speed is more slow than lpcnet and I want to make it fast.

I think if you want to make it fast you need to use C++ and take advantage of the optimizations.

1. How much time does it take to adapt to new voice? (time for training and inference)

Inference will be the same as a single speaker and adaption take about less than 10k steps, depends on your data.

2\. Is adapting to several people's voices possible? (like 2~3 people, or 100 people)

Adapting with 2-3 works fine. Don't know adapting with more speakers.

byuns9334 commented 5 years ago

@carlfm01 thanks!

What do you think of the reason LPCNet is so fast and memory-light? I've been thinking about this, and also want to know your opinion. Thanks!

carlfm01 commented 5 years ago

What do you think of the reason LPCNet is so fast and memory-light?

Read section 3.5 Sparse Matrices :https://jmvalin.ca/papers/lpcnet_icassp2019.pdf

Maxxiey commented 5 years ago

I found something strange and interesting, for the same sentence, the wav files generated sounds slightly different, you may not understanding Chinese Mandarin but the durations of some syllables varied quite a lot. Attached are some samples my model generated.

My understanding is that during the inference phase of Tacotron-2 and LPCnet there shouldn't be any random elements involved. So how does one explain the differences between these wav files?

samples.zip

Is there anyone who have the same situation? Does that mean the model I trained is unstable? Oh, taoctron now runs 200k steps (I use this repo https://github.com/carlfm01/Tacotron-2/tree/spanish) and LPCNet runs just 10 epochs (https://github.com/MlWoo/LPCNet).

m-toman commented 5 years ago

There are at least two reasons. For one there are usually sampling processes involved in the models. And probably the bigger issue is that Tacotron uses dropout at inference time. In the paper this is touted as advantage to create more variation. But in fact it seems to not work really well without it. You can find some workarounds flying around but they also come with their own disadvantages.

Edit; don't know if your implementation uses one of those workarounds. But see for example here https://github.com/Rayhane-mamah/Tacotron-2/blob/ab5cb08a931fc842d3892ebeb27c8b8734ddd4b8/tacotron/models/modules.py#L247

snakers4 commented 5 years ago

Hi guys,

I tried to read the majority of the above posts, and on the surface it seems that just using dump_data with Tacotron with or without re-training the vocoder.

But do you know, if there exists a fully python version of dump_data? Our dataset is very big and usually it works much better when you sample data on-the-fly.

byuns9334 commented 5 years ago

@carlfm01 Hi, how are you? :)

Have you tried training LPCNet with data which has audios from multi-speakers, not single-speaker? I've tried this for 120 epochs and the output quality is so bad.

Also... when merging with tacotron2, do you have any idea to train LPCNet with several features files (f32), not just one f32 file? we want to train LPCNet on features that are generated from tacotron2. and you know f32 from concatenation of wav file is not just concatenation of f32 of each wav, so we are trying to figure out how to train with multiple f32 files.

byuns9334 commented 4 years ago

@carlfm01 when I tried to train LPCNet with much more huge data than before (like, 20G), it fails because 'can't reshape into (x, 55)' problem.

And the reason is: src/train_lpcnet.py, when the data is small enough, the f32 feature is truncated by 'features = features[:nb_framesfeature_chunk_sizenb_features]', where 'nb_frames = len(data)//(4pcm_chunk_size)', so it is possible to reshape features into (nb_framesfeature_chunk_size, nb_features)) shape, as written in line 89 of src/train_lpcnet.py

However, when data size is extremely long, 'nb_framesfeature_chunk_sizenb_features' value is much larger thatn len(featues), so feature doesn't get truncated at all. So, it always fails to reshape f32 file.

I think you must have experienced this problem when you tried to train LPCNet with very large audio set. Could you please share how you solved this?

I guess you are so busy these days so it's okay not to reply if you don't have time.

Thanks!

byuns9334 commented 4 years ago

@zshakeri i used Ubuntu 16.04

Maxxiey commented 4 years ago

@byuns9334 Hi, I noticed that your have successfully trained a famale voice, did you change any parameter in any of Tacotron or LPCNet? I keep them unchanged but get a voice with poor quality, you can check the samples I uploaded. Any suggestion will be appreciated and thanks in advance. samples.zip

ysujiang commented 4 years ago

I'll very thanks if anyone can help me. Looking forward to your reply。 I tried https://github.com/carlfm01/Tacotron-2/tree/spanish (100k)+https://github.com/MlWoo/LPCNet(10epochs) . No parameters changed ,but the generated samples sound not good .here are my samples and trained steps. for LPCnet,

  1. used concat.sh to obtained input.s16 2.make clean make dump_data ./dump_data -train input.s16 features.f32 data.u8
  2. python train_lpcnet.py features.f32 data.u8

for tacotron2: 1.make clean && make dump_data taco=1 2../header_removal.sh ./feature_extract.sh 3.Convert the data generated which has .f32 extension to what could be loaded with numpy. and replaced the .npy file in the audio. 4.python train.py

when synthesis for tacotron 2 , get .f32 file---f32_for_lptnet.f32 for LPCnet changes test_lpcnet,py --- model.load_weights('model_loss-2.847120.hdf5') make clean && make test_lpcnet taco=1 ./test_lpcnet f32_for_lptnet.f32 test.s16 ffmpeg -f s16le -ar 16k -ac 1 -i test.s16 test-out.wav

sample.zip

Is there anything wrong with my process? and why the sample sound not good.How can I adjust my process?

cyxomo commented 4 years ago

the core of issue is that the bark spectrogram used in the LPCNet, however, the mel spectrogram generated by tacotron. So how to get bark spectrogram from mel spectorgram. Or how to caculate LPC by mel.

Maxxiey commented 4 years ago

@ysujiang seems that your steps are alright, wonder if waves in your train set have the same volumes, care to share a few?

ysujiang commented 4 years ago

@Maxxiey in my train set ,waves have the same volumes,but my synthesized samples are not. synthesized samples also have trill. do you know why? and can you give me some advices to adjust parameters?If possible, I'll thank you very much.Look forward to your reply. train_sample.zip synthesized_samples.zip

Maxxiey commented 4 years ago

Hi, @ysujiang .

Ran some tests on your data and it seems fine to me.

Here is what I do, use header_removal.sh and feature_extract.sh to generate featrues and use test_lpcnet to turn them back into wavs. During the whole process, no warning message appears, so I guess everything works fine. Samples generated are as follow, trills are still there cause my model is trained on my own dataset, but all wavs have almost the same volume. debug_sample.zip

As for training parameters, I left them untouched and the result came out just fine, so sorry no advice from me. Maybe you should check your .s16 files to see if they have the same volume with the original wavs.

BTW, if it is only the differece volume that troubles you, pydub should do the thick. https://github.com/jiaaro/pydub/blob/master/API.markdown#audiosegmentmax_dbfs

ysujiang commented 4 years ago

@ Maxxiey您好,我听您这个用训练集的数据测试LPCNET的时候也是有颤音的情况,这是LPCNET的问题吗

ysujiang commented 4 years ago

@Maxxiey tanks for your help. I do the same thing with you . I use the voice of people different from the training set. 1.make clean && make dump_data taco=1 2../header_removal.sh ./feature_extract.sh I got some feature files(*.f32) then use test_lpcnet to turn the feature files back to wavs.
make clean && make test_lpcnet taco=1 ./test_lpcnet f32_for_lptnet.f32 test.s16 ffmpeg -f s16le -ar 16k -ac 1 -i test.s16 test-out.wav it does't work well the tremolo is very obvious. I used the same method to test the voice of the training set,it works fine.
Do you know why?Have you changed anything of LPCNET? What should I change? train_data.zip test_result.zip

wizardk commented 4 years ago

@ysujiang You need to use not only the features extracted directly from wavs but also the features ouput from tacotron2.

LqNoob commented 3 years ago

@ysujiang You need to use not only the features extracted directly from wavs but also the features ouput from tacotron2.

@wizardk What is the result before and after you adopt this method?