winddori2002 / TriAAN-VC

TriAAN-VC: Triple Adaptive Attention Normalization for Any-to-Any Voice Conversion
MIT License
129 stars 12 forks source link

resluts #16

Closed Blakey-Gavin closed 10 months ago

Blakey-Gavin commented 11 months ago

Hi,

I retrained TriAAN-VC and ParallelWaveGAN with the Chinese dataset. Since there is no phoneme label, I did not fine-tune CPC_audio, and the final result is as follows:

mel-features: CER: | s2s_st: 0.5101 | s2s_ut: 0.4329 | u2u_st: 0.4561 | u2u_ut: 0.4826 WER: | s2s_st: 0.8145 | s2s_ut: 0.6441 | u2u_st: 0.7425 | u2u_ut: 0.7286 ASV ACC: | s2s_st: 0.7200 | s2s_ut: 0.9067 | u2u_st: 0.7767 | u2u_ut: 0.8933 ASV COS: | s2s_st: 0.7321 | s2s_ut: 0.7770 | u2u_st: 0.7379 | u2u_ut: 0.7702

cpc-features: CER: | s2s_st: 0.3831 | s2s_ut: 0.3040 | u2u_st: 0.3313 | u2u_ut: 0.3476 WER: | s2s_st: 0.6723 | s2s_ut: 0.4938 | u2u_st: 0.5674 | u2u_ut: 0.5691 ASV ACC: | s2s_st: 0.7900 | s2s_ut: 0.9567 | u2u_st: 0.8233 | u2u_ut: 0.9267 ASV COS: | s2s_st: 0.7637 | s2s_ut: 0.8076 | u2u_st: 0.7573 | u2u_ut: 0.7941

The result is still quite different from the result in your paper, I don't know what happened, can you give me some advice?

Thanks.

winddori2002 commented 11 months ago

Hi,

Have you checked that WER and CER scores of ground truths reconstructed by vocoder? I think it is necessary to check the ASR and vocoder performance first.

And how did you split data? Generally, s2s_st (seen-to-seen speakers and text) is the easiest setting to make converted outputs.

Thanks.

Blakey-Gavin commented 11 months ago

First,Thank you very much for your reply.

Here's how I do what you said:

  1. I replaced the ASR model with "Wav2Vec2ForCTC.from_pretrained('qinyue/wav2vec2-large-xlsr-53-chinese-zn-cn-aishell1')" when calculating WER and CER, which was trained with the Chinese dataset. For the vocoder, I followed the config file you gave and modified ParallelWave a bit to fit my data, training it from scratch for about 1,500,000 steps.

  2. The AISHEL-3 data set I used, the training set, the verification set and the test set are divided as follows:

    AISHELL-3 dataset total files: 88035, Train: 56969, Valid: 15538, Test: 15528, Del Files: 0

Then the next processing is the same as yours, except for some small adjustments required to replace the dataset, I have hardly modified any code.

winddori2002 commented 11 months ago

In my case, When I use the ASR model on original waveforms, WER and CER scores are about 5% and 2%. And when I use the ASR model on reconstructed waveforms, WER and CER scores are about 7% and 3%. (original waves -> mel transform -> vocoder -> reconstructed waves) The above results are from the VCTK dataset.

Since the evaluation metrics are based on pre-trained other models, it is necessary to check their performance on your dataset without conducting conversion.

And what kind of pre-trained voice encoder or which dataset is trained on the voice encoder? (voice encoder for evaluating SV scores)

Blakey-Gavin commented 11 months ago

I'm going to test it the way you said. Thank you very much for your patience and guidance.

Blakey-Gavin commented 11 months ago

I would like to ask when testing the WER and CER of the original waveforms and the reconstructed waveforms on the ASR model, does the waveforms here refer to the entire dataset? What if I use part of the dataset as training data for the vocoder during reconstruction?

winddori2002 commented 11 months ago

I only tested WER and CER on the validation and test dataset. Additionally, if you mean that you use the part of the dataset for training vocoder, I think it does not matter.

Blakey-Gavin commented 11 months ago

I use the wav2vec2 ASR model of transformers (I have tested many fine-tuned models in Chinese, "wbbbbb/wav2vec2-large-chinese-zh-cn" is the best), and the calculated WER and CER are as follows: original waveforms----->cer: 0.08134103931377949 wer: 0.15630324684410174 reconstructed waveforms----->cer: 0.09729064056277562 wer: 0.18220915162698909

I think the above results show that the performance of the vocoder is acceptable. Since Chinese WER and CER calculations are not the same as English, it's not clear if it's normal to get such results for Chinese (I don't know much about ASR).

Now I have adjusted the relevant code of the resemblyzer library to fit my dataset and I am currently retraining the VoiceEncoder. I'm going to calculate the ASV threshold value of my dataset, but I don't know how much "n_samples" should be set in this code (https://github.com/tzuhsien/Voice-conversion-evaluation/blob/master/metrics/speaker_verification/equal_error_rate/calculate_eer.py). Do you have any good suggestions?

winddori2002 commented 11 months ago

When I calculate the threshold, I use all samples since this process is only used for evaluation.

Blakey-Gavin commented 11 months ago

Ok, Thanks a lot.

I used the Chinese dataset to retrain VoiceEncoder, and the results are as follows: image image

Then, I used the retrained weights to extract the embedding of each sentence to calculate the threshold value, but I got some errors (ValueError: Input contains NaN). Have you encountered this error?

After my inspection, these NaN values ​​are in the embedding of some utterances, but the strange thing is that my loss during training does not appear NaN. How can I solve this problem?

  1. Delete vectors containing NaN values;
  2. Replace NaN values with mean or median;
  3. Ignore NaN values. Which of the three methods above do you think is the best? Or do you have a better suggestion?
winddori2002 commented 11 months ago

If possible, you can remove the utterances containing nan-values and retrain the voice encoder. Otherwise, option 1 you mentioned looks acceptable.

Blakey-Gavin commented 11 months ago

Yep, I did that too, nothing seems to work.

But by the way, if VC modules, VoiceEncoder, ASR model and vocoder are all trained with the same data set, what will be the impact?

Could it lead to results that perform better when evaluated(CER, WER and ASV), but actually sound worse when transformed using additional data?

winddori2002 commented 11 months ago

CER and ASV are definitely affected by the performance of ASR and SV models. And what we want to evaluate is the VC model, not ASR and SV models. To remove the effect of ASR and SV performance on the evaluation results, it is better to train ASR and SV on the same dataset. That is, the scores are somewhat dependent on the ASR and SV (it's inevitable when using model-based metrics). Thus, it is necessary to consider their effects when you investigate the results.

Blakey-Gavin commented 11 months ago

Ok, got it, the result I'm currently getting is the following: CER: | s2s_st: 0.2708 | s2s_ut: 0.2045 | u2u_st: 0.2194 | u2u_ut: 0.2354 WER: | s2s_st: 0.4790 | s2s_ut: 0.3291 | u2u_st: 0.4022 | u2u_ut: 0.4116 ASV ACC: | s2s_st: 0.9800 | s2s_ut: 1.0000 | u2u_st: 0.9933 | u2u_ut: 0.9900 ASV COS: | s2s_st: 0.9432 | s2s_ut: 0.9566 | u2u_st: 0.9402 | u2u_ut: 0.9470

Although the results seem to be good for ASV, when it comes to actual speech conversion (using speech from another dataset), it seems not so good.

And it seems that noise is introduced because of the vocoder, because even "original waveforms--->mel--->reconstructed waveforms" have some noise.

Blakey-Gavin commented 11 months ago

Hi,

I was wondering if you have used other feature extractors as an alternative to the CPC_audio model? Such as wav2vec 2.0 and WavLM.

winddori2002 commented 10 months ago

I also tried other features, but CPC outperforms other features. But I think it can be different depending on the dataset or pre-trained models.