How to perform text to speech

sranjeet81 commented 5 years ago

@jmvalin Thanks for hosting this interesting project. As part of the usecases for LPCNet you mention about TTS(Text-To-Speech). How do we synthesis speech from text using the test_lpcnet.py?

If this is not the approach to implement TTS, do you have any recommendation on where to start with LPCNet for implementing end to end TTS system?

jmvalin commented 5 years ago

LPCNet is basically one half of a TTS system. It takes an acoustic feature vector every 10 ms and outputs speech samples. For TTS, you also need a network that takes in characters and outputs these acoustic feature vectors.

changeforan commented 5 years ago

@jmvalin Hi, I have trained a taco2 model to predict the 18-band Bark-scale and 2 pitch parameters. Can you tell me how to compute the LPC from Bark-scale cepstum, or which part in denoise.c do this work? Thank you.

jmvalin commented 5 years ago

@changeforan To compute the LPC coefficients, look for the _celt_lpc() function in denoise.c. The process starts from Ex, computed by compute_band_energy(), so you'd need to invert a few more steps, but that shouldn't be too hard.

changeforan commented 5 years ago

@jmvalin Thanks for your quick response, but I am still confused. It seems like you compute the LPC at line 399 and assign them to features[39:55] at line 448, but if features[0:18] are Bark-scale coefficientis, they were computed after line 399. I think features[39:55] should be computed from features[0:18] after read your paper. 18-band Bark-frequency cepstrum ----> PSD ----> auto-correlation ----> LPC Am I right?

jmvalin commented 5 years ago

The LPC are the same as if they'd been computed on features[0:18]. The spectrum on which they're computed in the C code is the same that's used to compute the cepstrum and the operation is reversible.

gosha20777 commented 5 years ago

SO IT WORKS. Here are my samples. https://yadi.sk/d/mBUJVSCzVVd2fQ I achieved the result in the following steps:

Load the pre-trained model.
Take a WAV sample of trained tacotron-2 without vocoder (01.wav).
Convert it to 16bit 16kHz mono raw PCM (sox taco2-out.wav -b 16 -s -c 1 -r 16k -t raw - > input.s16).
Compile the data processing program (./compile.sh) and run ti (/dump_data input.s16 exc.s8 features.f32 pred.s16 pcm.s16) to get features.f32 file.
Synthesis speech with LPCNet (./test_lpcnet.py features.f32 > pcm.txt). it works quite slow...
Convert pcm.txt to PCNet-out.wav (ffmpeg -f s16le -ar 16k -ac 1 -i pcm.txt PCNet-out.wav)

So, I'am right? But why it works so slow? P.s. And with RNN vocoder i've got better results...

So if i'm right i'll try connect Tacotron-2 and LPCNet. Or... Or it will better choice to use something else in stand of Tacotron-2?

jmvalin commented 5 years ago

Well, the way it's normally supposed to work is that you train Tacotron (or whatever network) to directly output features that LPCNet can use. No need to run the synthesis twice (though in this case I guess it was easier for testing purposes).

gosha20777 commented 5 years ago

Thanks for your respone. Yes It works. Of course I've synthesized the sound from takotron2 to demonstrate the result (as to say show progress). I tested LCPNet for Korean and Russian. The results are impressive. I will develop an implimentation of Tacotron2 for a closer connection with LCPNet to make end 2 end stt system. If Tacotron 2 will be work on server (without WaveNet vocoder) and LCPNet will be work on the clients it solves many problems, and reduce server load up to 10 times.

attitudechunfeng commented 5 years ago

@gosha20777 What acoustic features are you used when you train the TTS model? I've trained with both 55 dimension features and 21 dimension features, however, the results are not good.

gosha20777 commented 5 years ago

I ve got features from english multi speaker dataset. About 8 hours

attitudechunfeng commented 5 years ago

With the original 55 dimension features or other features ?

gosha20777 commented 5 years ago

Hmm. I'm not sure... But in my apinion it was 20 dim features.

Try to learn LONG TIME. I v trained it about 5 days in 2x Nvidia 1080 ti. I ve used horovod library to parallel it.

gosha20777 commented 5 years ago

I can give u a pretrained model if u want.

attitudechunfeng commented 5 years ago

I can't understand what are the 120 dim features and how you extract the features. I'll appreciate it if there's some explanations. In my opinion, in the paper, they claimed using 20 dim features, and in the code it seems like using actually 55 dim features.

gosha20777 commented 5 years ago

Oh no! No 120 dim but 20 dim! Im so sorry :)

attitudechunfeng commented 5 years ago

In the code, it seems like 21 dim features rather than 20 dim. I've tried to predicted the 21 dim features, however, the results sounds not stable. My backbone model is not taco series, but a traditional rnn model.

changeforan commented 5 years ago

@attitudechunfeng I have reviewed the code and found that, features[18:36] is assigned to zero, features[36] and features[37] are about pitch. features[38] is not used at all. features[39:55] are about lpc.

attitudechunfeng commented 5 years ago

So it means that i only need to predict the [0:18] and [36:37] both 20 dim features? Do you have good results using these features? @changeforan

changeforan commented 5 years ago

So it means that i only need to predict the [0:18] and [36:37] both 20 dim features? Do you have good results using these features? @changeforan

with Taco2 model, Yes.

jmvalin commented 5 years ago

FYI, I don't think features[38] is useful for anything. OTOH, features[18:36] could potentially be useful for TTS.

hdmjdp commented 5 years ago

@attitudechunfeng the 21dim not to predict

attitudechunfeng commented 5 years ago

@hdmjdp what do you mean, can you explain it more detailedly？

hdmjdp commented 5 years ago

@attitudechunfeng it means you need not predict the period, so the net output 20dim

candlewill commented 5 years ago

I tried to predict lpcnet parameters directly using a tacotron model. The generated voice is not very good, and the attention seemed very strange. Here are some attention and samples (In Chinese). Is there someone also have this situation, and knows how to explain this?

More: tacotron_lpcnet.zip

jmvalin commented 5 years ago

Are you training end-to-end or are you just learning the LPCNet features from text? Also, make sure that the LPC features are not predicted, but rather computed directly from the predicted cepstral features.

azraelkuan commented 5 years ago

@candlewill may be u used the wrong feature as jmvalin said, my alignement is very good. and compared to the mel spectrogram, it is much easy to get the alignment.

candlewill commented 5 years ago

Thanks @jmvalin and @azraelkuan, I predict all of the 55d features when do end-to-end training. I will try to change the features to predict.

ohleo commented 5 years ago

@azraelkuan Looks great! Could you share your synthesized speech from Tacotron + LPCNet?

LPCNet acoustic feature features[:18] : 18-dim Bark scale cepstrum features[18:36] : Not used features[36:37] : pitch period(what is this value?) features[37:38] : pitch correlation(what is this value?) features[39:55] : LPC(calculated by cepstrum) window_size (=n_fft) = 320 (is it right?) frame_shift(=hop_size) = 160 (is it right?)

And, did you train Tacotron to predict 20-dim feature(concat. the 18dim cepstrum and 2 pitch param.) instead of 80-dim mel-spectrogram? (In that case, Decoder LSTM input will be 20-dim concatenated feature.)

Or, only 18-dim cepstrum is the input of Decoder LSTM, and 2 pitch param output is predicted by dense projection likewise stop-token?

Could you explain more detailed structure or tips for training? (e.g. window_size, hop_size(=frame shift), and normalization of feature)

I would appreciate your reply.

azraelkuan commented 5 years ago

feature: 20-dim concatenated feature, i do not split them, i can not share the samples, sorry

hdmjdp commented 5 years ago

@azraelkuan what is your repo of tacotron？

azraelkuan commented 5 years ago

@hdmjdp https://github.com/keithito/tacotron

candlewill commented 5 years ago

I changed the features to predict, then the attention could be learnt well. Here is some samples with 16k pcm format generated from an end2end+lpcnet model. e2e_lpcnet_samples.zip

hdmjdp commented 5 years ago

@azraelkuan why not use tacotron2?

hdmjdp commented 5 years ago

@candlewill how to convert chinese char to vector?

bearlu007 commented 5 years ago

I changed the features to predict, then the attention could be learnt well. Here is some samples with 16k pcm format generated from an end2end+lpcnet model. e2e_lpcnet_samples.zip

May I know how you change your features for modeling and prediction ?

bearlu007 commented 5 years ago

I changed the features to predict, then the attention could be learnt well. Here is some samples with 16k pcm format generated from an end2end+lpcnet model. e2e_lpcnet_samples.zip

May I know how you change your features for modeling and prediction ?

@candlewill Thanks

candlewill commented 5 years ago

@bearlu007 Here is some of my code you could use it as a reference:

55d to 20d:

def reduce_dim(features):
    """ reduce dimension from 55d to 20d
    keep features[0:18] and features[36:38] only
    :param features: 55d
    :return: 20d
    """
    N, D = features.shape
    assert D == 55, "Dimension error. %sx%s" % (N, D)
    features = np.concatenate((features[:, 0:18], features[:, 36:38]), axis=1)
    assert features.shape[1] == 20, "Dimension error. %s" % str(features.shape)
    return features

convert 20d to 55d when test

    features = np.zeros((N, 55))
    features[:, 0:18] = input[:, 0:18]
    features[:, 36:38] = input[:, 18:20]

bearlu007 commented 5 years ago

@bearlu007 Here is some of my code you could use it as a reference:

55d to 20d:

def reduce_dim(features):
  """ reduce dimension from 55d to 20d
  keep features[0:18] and features[36:38] only
  :param features: 55d
  :return: 20d
  """
  N, D = features.shape
  assert D == 55, "Dimension error. %sx%s" % (N, D)
  features = np.concatenate((features[:, 0:18], features[:, 36:38]), axis=1)
  assert features.shape[1] == 20, "Dimension error. %s" % str(features.shape)
  return features

convert 20d to 55d when test

  features = np.zeros((N, 55))
  features[:, 0:18] = input[:, 0:18]
  features[:, 36:38] = input[:, 18:20]

Clear enough. Thanks a lot .

attitudechunfeng commented 5 years ago

@azraelkuan I have a question about the predicted features. When training with tacotron, do you only use LPCNET features? Or LPCNET features and Linear spectrogram?

azraelkuan commented 5 years ago

@attitudechunfeng only lpcnet features, 20 dimension

attitudechunfeng commented 5 years ago

thanks for your quick reply. And after how many steps the alignment becomes well?

azraelkuan commented 5 years ago

@attitudechunfeng about 5k step, i use the real lpc feature in the training decode step.

hdmjdp commented 5 years ago

@azraelkuan this repo can not give the time of when to stop?

azraelkuan commented 5 years ago

@hdmjdp u can add a stop token to predict it

hdmjdp commented 5 years ago

@azraelkuan how to add in decoder cell?

hyzhan commented 5 years ago

@jmvalin If I want to normalize the cepstral coefficients, how should I choose the normalization range? The magnitude of cepstral coefficients seems to vary a lot.

jmvalin commented 5 years ago

Why do you want to normalize the cepstral coefficients?

hyzhan commented 5 years ago

I tried to combine tacotron with LPCNet, which succeeded in a big data set, but failed in a small data set. (The dataset extraction feature only takes one round.) The tacotron output may have a period greater than 3.1, which I think will cause problems in training the LPCNet network (although training does not report an error). So I plan to normalize the cepstrum and pitch parameters.

hdmjdp commented 5 years ago

@jmvalin Hi, in your makefile, you give the A53's compile option. Does this mean that this repo can run in realtime on A53 chip? but we find it runs slow than realtime much. why

jmvalin commented 5 years ago

LPCNet is not yet real-time on the A53. That's a pretty slow chip. We've managed real-time performance on an iPhone6 though. So it should run in real-time on most modern smartphones. Just not on RaspberryPi yet. That may eventually be achievable, but that's not what we're working on atm.

xiph / LPCNet

How to perform text to speech #4