yl4579 / StarGANv2-VC

StarGANv2-VC: A Diverse, Unsupervised, Non-parallel Framework for Natural-Sounding Voice Conversion
MIT License
466 stars 110 forks source link

question about mean / std,and training time #73

Closed yt605155624 closed 1 year ago

yt605155624 commented 1 year ago

hi, I want to reproduce StarGANv2-VC with PaddlePaddle for PaddleSpeech, but I found that the preprocess methods are all the same for StarGANv2-VC/ASR and JDCNet, witch means you use the same melspec and mean/std https://github.com/yl4579/StarGANv2-VC/blob/c9df527bc5e7f55743d22a6155c1b8e1db7d7d4a/meldataset.py#L52

I want to know some details:

  1. Why didn't you use the melspec of kan-bayashi/ParallelWaveGAN so that you will not train a PWGAN yourself, and you can also use the HiFiGAN/MB_MelGAN trained by kan-bayashi/ParallelWaveGAN
  2. How did you get self.mean, self.std = -4, 4?
  3. I want to use the melspec of PaddleSpeech because we have trained many vocoders, so I have to train new JDCNet and ASR model using our melspec, I wanna know the training time of

    • StarGANv2-VC
    • JDCNet
    • ASR Model

    If the training time is to long, maybe I have to use your melspec and mean / std, which makes the StarGANv2-VC in PaddleSpeech can not use our pretrained vocoders

here is my working pr: https://github.com/PaddlePaddle/PaddleSpeech/pull/2842/

waiting for your early reply~

yt605155624 commented 1 year ago

by the way, I have found a small bug of your preprocess methods:

you have set the sr=24000 in config, but when you call torchaudio, https://github.com/yl4579/StarGANv2-VC/blob/c17b458ed803792e270f5d8d3c038155404d04e7/meldataset.py#L50 you haven't set sr, and the default sr of torchaudio is 16000, so, the sr of mel_spec in StarGANv2-VC / JDCNet / ASR Model are all 16k, that's why I want to reproduce with my own preprocess

yl4579 commented 1 year ago
  1. Why didn't you use the melspec of kan-bayashi/ParallelWaveGAN so that you will not train a PWGAN yourself, and you can also use the HiFiGAN/MB_MelGAN trained by kan-bayashi/ParallelWaveGAN The preprocessing requires mean and std to be computed from a certain dataset that makes it difficult to scale to other dataset. I trained my own vocoder with these preprocessing, though they were definitely not the best.

  2. How did you get self.mean, self.std = -4, 4? Those were arbitrary numbers just from a rough calculation using LibriTTS, JVS and other big datasets. They didn't represent the actual mean and standard deviation for normalization, just a rough approximation.

  3. I want to use the melspec of PaddleSpeech because we have trained many vocoders, so I have to train new JDCNet and ASR model using our melspec, I wanna know the training time of They should be very fast to train. JDCNet is the fastest, though computing the ground truth F0 is quite slow. After a first epoch, you should be able to finish training it in a few hours on a Nvidia A40. ASR is also very fast, it took me about 1 day to train on LibriTTS, JVS and AiShell. Those were including data augmentation which is highly recommended for realistic settings.

  4. you haven't set sr, and the default sr of torchaudio is 16000, so, the sr of mel_spec in StarGANv2-VC / JDCNet / ASR Model are all 16k, that's why I want to reproduce with my own preprocess See https://github.com/yl4579/StarGANv2-VC/issues/10, it is indeed a mistake, so the sound quality should improve with better processing. I will get rid of vocoder and these preprocessing in my future work.