Closed sranjeet81 closed 3 years ago
LPCNet is basically one half of a TTS system. It takes an acoustic feature vector every 10 ms and outputs speech samples. For TTS, you also need a network that takes in characters and outputs these acoustic feature vectors.
@jmvalin Hi, I have trained a taco2 model to predict the 18-band Bark-scale and 2 pitch parameters. Can you tell me how to compute the LPC from Bark-scale cepstum, or which part in denoise.c do this work? Thank you.
@changeforan To compute the LPC coefficients, look for the _celt_lpc() function in denoise.c. The process starts from Ex, computed by compute_band_energy(), so you'd need to invert a few more steps, but that shouldn't be too hard.
@jmvalin Thanks for your quick response, but I am still confused. It seems like you compute the LPC at line 399 and assign them to features[39:55] at line 448, but if features[0:18] are Bark-scale coefficientis, they were computed after line 399. I think features[39:55] should be computed from features[0:18] after read your paper. 18-band Bark-frequency cepstrum ----> PSD ----> auto-correlation ----> LPC Am I right?
The LPC are the same as if they'd been computed on features[0:18]. The spectrum on which they're computed in the C code is the same that's used to compute the cepstrum and the operation is reversible.
SO IT WORKS. Here are my samples. https://yadi.sk/d/mBUJVSCzVVd2fQ I achieved the result in the following steps:
sox taco2-out.wav -b 16 -s -c 1 -r 16k -t raw - > input.s16
)../compile.sh
) and run ti (/dump_data input.s16 exc.s8 features.f32 pred.s16 pcm.s16
) to get features.f32
file../test_lpcnet.py features.f32 > pcm.txt
).
it works quite slow...pcm.txt
to PCNet-out.wav
(ffmpeg -f s16le -ar 16k -ac 1 -i pcm.txt PCNet-out.wav
)So, I'am right? But why it works so slow? P.s. And with RNN vocoder i've got better results...
So if i'm right i'll try connect Tacotron-2 and LPCNet. Or... Or it will better choice to use something else in stand of Tacotron-2?
Well, the way it's normally supposed to work is that you train Tacotron (or whatever network) to directly output features that LPCNet can use. No need to run the synthesis twice (though in this case I guess it was easier for testing purposes).
Thanks for your respone. Yes It works. Of course I've synthesized the sound from takotron2 to demonstrate the result (as to say show progress). I tested LCPNet for Korean and Russian. The results are impressive. I will develop an implimentation of Tacotron2 for a closer connection with LCPNet to make end 2 end stt system. If Tacotron 2 will be work on server (without WaveNet vocoder) and LCPNet will be work on the clients it solves many problems, and reduce server load up to 10 times.
@gosha20777 What acoustic features are you used when you train the TTS model? I've trained with both 55 dimension features and 21 dimension features, however, the results are not good.
I ve got features from english multi speaker dataset. About 8 hours
With the original 55 dimension features or other features ?
Hmm. I'm not sure... But in my apinion it was 20 dim features.
Try to learn LONG TIME. I v trained it about 5 days in 2x Nvidia 1080 ti. I ve used horovod library to parallel it.
I can give u a pretrained model if u want.
I can't understand what are the 120 dim features and how you extract the features. I'll appreciate it if there's some explanations. In my opinion, in the paper, they claimed using 20 dim features, and in the code it seems like using actually 55 dim features.
Oh no! No 120 dim but 20 dim! Im so sorry :)
In the code, it seems like 21 dim features rather than 20 dim. I've tried to predicted the 21 dim features, however, the results sounds not stable. My backbone model is not taco series, but a traditional rnn model.
@attitudechunfeng I have reviewed the code and found that, features[18:36] is assigned to zero, features[36] and features[37] are about pitch. features[38] is not used at all. features[39:55] are about lpc.
So it means that i only need to predict the [0:18] and [36:37] both 20 dim features? Do you have good results using these features? @changeforan
So it means that i only need to predict the [0:18] and [36:37] both 20 dim features? Do you have good results using these features? @changeforan
with Taco2 model, Yes.
FYI, I don't think features[38] is useful for anything. OTOH, features[18:36] could potentially be useful for TTS.
@attitudechunfeng the 21dim not to predict
@hdmjdp what do you mean, can you explain it more detailedly?
@attitudechunfeng it means you need not predict the period, so the net output 20dim
I tried to predict lpcnet parameters directly using a tacotron model. The generated voice is not very good, and the attention seemed very strange. Here are some attention and samples (In Chinese). Is there someone also have this situation, and knows how to explain this?
More: tacotron_lpcnet.zip
Are you training end-to-end or are you just learning the LPCNet features from text? Also, make sure that the LPC features are not predicted, but rather computed directly from the predicted cepstral features.
@candlewill may be u used the wrong feature as jmvalin said, my alignement is very good. and compared to the mel spectrogram, it is much easy to get the alignment.
Thanks @jmvalin and @azraelkuan, I predict all of the 55d features when do end-to-end training. I will try to change the features to predict.
@azraelkuan Looks great! Could you share your synthesized speech from Tacotron + LPCNet?
And, did you train Tacotron to predict 20-dim feature(concat. the 18dim cepstrum and 2 pitch param.) instead of 80-dim mel-spectrogram? (In that case, Decoder LSTM input will be 20-dim concatenated feature.)
Or, only 18-dim cepstrum is the input of Decoder LSTM, and 2 pitch param output is predicted by dense projection likewise stop-token?
Could you explain more detailed structure or tips for training? (e.g. window_size, hop_size(=frame shift), and normalization of feature)
I would appreciate your reply.
feature: 20-dim concatenated feature, i do not split them, i can not share the samples, sorry
@azraelkuan what is your repo of tacotron?
I changed the features to predict, then the attention could be learnt well. Here is some samples with 16k pcm format generated from an end2end+lpcnet model. e2e_lpcnet_samples.zip
@azraelkuan why not use tacotron2?
@candlewill how to convert chinese char to vector?
I changed the features to predict, then the attention could be learnt well. Here is some samples with 16k pcm format generated from an end2end+lpcnet model. e2e_lpcnet_samples.zip
May I know how you change your features for modeling and prediction ?
I changed the features to predict, then the attention could be learnt well. Here is some samples with 16k pcm format generated from an end2end+lpcnet model. e2e_lpcnet_samples.zip
May I know how you change your features for modeling and prediction ?
@candlewill Thanks
@bearlu007 Here is some of my code you could use it as a reference:
def reduce_dim(features):
""" reduce dimension from 55d to 20d
keep features[0:18] and features[36:38] only
:param features: 55d
:return: 20d
"""
N, D = features.shape
assert D == 55, "Dimension error. %sx%s" % (N, D)
features = np.concatenate((features[:, 0:18], features[:, 36:38]), axis=1)
assert features.shape[1] == 20, "Dimension error. %s" % str(features.shape)
return features
features = np.zeros((N, 55))
features[:, 0:18] = input[:, 0:18]
features[:, 36:38] = input[:, 18:20]
@bearlu007 Here is some of my code you could use it as a reference:
- 55d to 20d:
def reduce_dim(features): """ reduce dimension from 55d to 20d keep features[0:18] and features[36:38] only :param features: 55d :return: 20d """ N, D = features.shape assert D == 55, "Dimension error. %sx%s" % (N, D) features = np.concatenate((features[:, 0:18], features[:, 36:38]), axis=1) assert features.shape[1] == 20, "Dimension error. %s" % str(features.shape) return features
- convert 20d to 55d when test
features = np.zeros((N, 55)) features[:, 0:18] = input[:, 0:18] features[:, 36:38] = input[:, 18:20]
Clear enough. Thanks a lot .
@azraelkuan I have a question about the predicted features. When training with tacotron, do you only use LPCNET features? Or LPCNET features and Linear spectrogram?
@attitudechunfeng only lpcnet features, 20 dimension
thanks for your quick reply. And after how many steps the alignment becomes well?
@attitudechunfeng about 5k step, i use the real lpc feature in the training decode step.
@azraelkuan this repo can not give the time of when to stop?
@hdmjdp u can add a stop token to predict it
@azraelkuan how to add in decoder cell?
@jmvalin If I want to normalize the cepstral coefficients, how should I choose the normalization range? The magnitude of cepstral coefficients seems to vary a lot.
Why do you want to normalize the cepstral coefficients?
I tried to combine tacotron with LPCNet, which succeeded in a big data set, but failed in a small data set. (The dataset extraction feature only takes one round.) The tacotron output may have a period greater than 3.1, which I think will cause problems in training the LPCNet network (although training does not report an error). So I plan to normalize the cepstrum and pitch parameters.
@jmvalin Hi, in your makefile, you give the A53's compile option. Does this mean that this repo can run in realtime on A53 chip? but we find it runs slow than realtime much. why
LPCNet is not yet real-time on the A53. That's a pretty slow chip. We've managed real-time performance on an iPhone6 though. So it should run in real-time on most modern smartphones. Just not on RaspberryPi yet. That may eventually be achievable, but that's not what we're working on atm.
@jmvalin Thanks for hosting this interesting project. As part of the usecases for LPCNet you mention about TTS(Text-To-Speech). How do we synthesis speech from text using the test_lpcnet.py?
If this is not the approach to implement TTS, do you have any recommendation on where to start with LPCNet for implementing end to end TTS system?