yl4579 / StarGANv2-VC

StarGANv2-VC: A Diverse, Unsupervised, Non-parallel Framework for Natural-Sounding Voice Conversion
MIT License
466 stars 110 forks source link

Can you provide the code for ASR and F0 network training? #9

Closed 980202006 closed 2 years ago

980202006 commented 2 years ago

I want to retrain the ASR and F0 models with my own data set. Can I provide the code, or are there any precautions when building the training code?

yl4579 commented 2 years ago

The training code is very messy, so it may take yet another several weeks or even more time to be done. I figure it'll be faster if you just refer to other repos to train your own models, but if you are willing to wait I may try to clean up the code and make them public.

The F0 models are trained with MSE loss to reconstruct the F0 curves estimated from the Yin algorithm using PyWorld. The ASR models are phoneme-level joint CTC-attention models. The losses are CTC loss on the input from the CNN and the cross-entropy loss on the input from the RNN.

Both F0 and ASR models are trained with data augmentation that can be found here: https://github.com/iver56/audiomentations. You may want to download the impulse response dataset here and the background noise dataset can be downloaded here.

980202006 commented 2 years ago

Thank you! I will try to implement the code, and I am willing to wait for your training code.

Pathos0925 commented 2 years ago

Hello! I would also be interested in the training code for the ASR and F0 models. Thank you for your work on this code, it is very impressive!

wsstriving commented 2 years ago

The training code is very messy, so it may take yet another several weeks or even more time to be done. I figure it'll be faster if you just refer to other repos to train your own models, but if you are willing to wait I may try to clean up the code and make them public.

The F0 models are trained with MSE loss to reconstruct the F0 curves estimated from the Yin algorithm using PyWorld. The ASR models are phoneme-level joint CTC-attention models. The losses are CTC loss on the input from the CNN and the cross-entropy loss on the input from the RNN.

Both F0 and ASR models are trained with data augmentation that can be found here: https://github.com/iver56/audiomentations. You may want to download the impulse response dataset here and the background noise dataset can be downloaded here.

Hi, when you say you trained the F0 Network using MSE, does it mean it's not the same as the procedure adopted in the original JDC paper (Multitask CE loss) ? Thanks!

yl4579 commented 2 years ago

@wsstriving Yes, the F0 network is different from the original JDC paper. The num_class = 1 and the output is 1 dimensional scalar of the F0 in Hertz.

wsstriving commented 2 years ago

@wsstriving Yes, the F0 network is different from the original JDC paper. The num_class = 1 and the output is 1 dimensional scalar of the F0 in Hertz.

Thanks for the reply, I wonder whether you compared this mse method with the original classification (722 classes)? I am implementing this part and met a few unclear parts when referring to your code:

  1. The voice detection branch seems to be there but not used in the forward (I understand it's not needed in the feature extraction part)
  2. In the comment you mentioned the f0 feature extraction needs the mel to be prepared using 31-frame context window, which is not used in the inference code.

It would be great if you could help to clarify the above questions, thanks a lot! Shuai

yl4579 commented 2 years ago

@wsstriving I didn't compare the MSE with classification because the model was trained on the speech data, so there are no discrete labels as in the singing data. If you want to do singing conversion, using discrete labels is probably better, but you will also have to normalize predictions to account for the difference between male and female speakers.

The voice detection branch is not used for inference, but it was used during training because we want the model to be robust against noisy inputs, as during training we added background noise and other filters to make the model more robust. The silence labels at each frame can be easily inferred by thresholding the norms of mels.

The 31-frame context window was used in the original implementation, but it is not necessary either. If you look more closely, you'll see that seq_len in model.py does not appear in any of the parameters and therefore can be inferred from the input shape. I set it to 192 for convenience but really doesn't matter because seq_len = x.shape[-1] is the actual setting used. That is, seq_len is the number of frames in the melspectorgrams. I believe this is either a mistake or redundant code from the original implementation.

980202006 commented 2 years ago

Hi,how about the F0 and ASR code?

yl4579 commented 2 years ago

I'm still testing the code and will try to get it done as soon as possible.

980202006 commented 2 years ago

Great!

zhangziyi-knu commented 2 years ago

Hi, Thank you for your outstanding work and anyone completed the code for F0 and ASR? @yl4579 @980202006 It would be very appreciated if you can provide it :)

yl4579 commented 2 years ago

@zhangziyi-knu Sorry for the delay, but I'm pretty busy with my coursework, so it may take a few more weeks to get the code cleaned and tested.

GreatDarrenSun commented 2 years ago

@yl4579 Thank you very much for your great work. Can you provide more details about ASR loss? As you mentioned,The ASR models are phoneme-level joint CTC-attention models. The losses are CTC loss on the input from the CNN and the cross-entropy loss on the input from the RNN.How these losses are implemented.Very much looking forward to your reply.

GreatDarrenSun commented 2 years ago

@yl4579 How did you train the ASR network without labels?

980202006 commented 2 years ago

I tried to implement it and found a problem: for the silent segment f0, the model fits poorly. As shown below. 忽然之间 干声

980202006 commented 2 years ago

I did not find the yin algorithm in pyworld, can you give a code example?

yl4579 commented 2 years ago

@GreatDarrenSun It is not possible to train an ASR model without labels, but the ASR models are not that important in terms of what language it was trained on. I found that ASR models trained on English corpus also work with Japanese datasets, so the method can be said to be unsupervised. The point of ASR loss is to make sure the model preserves the formants after conversion. This is why we don't use the final predictions (PPG) but the intermediate activations for loss because we believe the intermediate layers of ASR models learn the formants from the MFCC that are largely independent of the language.

Here is the code for the loss part of the training. The rest of them is already in the models.py. I'm really sorry it takes so long, but I'm in the last semester of my master's program and am about to start my PhD so it's very busy. Hopefully I'll make the full training code available during the winter break.

In fact, you don't really need this S2S loss here, any CNN-based CTC model would do the job. The reason to use this model is that it also provides phoneme alignment that can be transferred to TTS tasks.

        text_input, text_input_length, mel_input, mel_input_length = batch
        mel_input_length = mel_input_length // (self.model.n_down) # downsample the final frames
        mel_mask = self.model.length_to_mask(mel_input_length)
        ppgs, s2s_pred, s2s_attn = self.model(
            mel_input, src_key_padding_mask=mel_mask, text_input=text_input)

        loss_ctc = torch.nn.functional.ctc_loss(ppgs.log_softmax(dim=2).transpose(0, 1),
                                      text_input, mel_input_length, text_input_length) # CNN CTC Loss

        loss_s2s = 0 # RNN S2S Loss
        for _s2s_pred, _text_input, _text_length in zip(s2s_pred, text_input, text_input_length):
            loss_s2s += torch.nn.functional.cross_entropy(_s2s_pred[:_text_length], _text_input[:_text_length])
        loss_s2s /= text_input.size(0)

        loss = loss_ctc + loss_s2s
        loss.backward()
yl4579 commented 2 years ago

@980202006 Did you use the silence prediction as well? You'll need to train the model exactly as presented in the JDC paper, where you have to estimate the F0 and silence jointly, otherwise it may still predict some erroneous F0 values during the silence.

The Yin algorithm is "harvest", see an example here. Also, I have attached my data loader function if it helps.

def path_to_mel_and_label(self, path):
        wave_tensor = self._load_tensor(path)

        # use pyworld to get F0
        output_file = path[0] + "_f0.npy"
        # check if the file exists
        if os.path.isfile(output_file): # if exists, load it directly
            f0 = np.load(output_file)
        else: # if not exist, create the F0 file
            x = wave_tensor.numpy().astype("double")
            frame_period = MEL_PARAMS['hop_length'] * 1000 / self.sr
            _f0, t = pw.harvest(x, self.sr, frame_period=frame_period)
            if sum(_f0 != 0) < self.bad_F0: # this happens when the algorithm fails
                _f0, t = pw.dio(x, self.sr, frame_period=frame_period) # if harvest fails, try dio
            f0 = pw.stonemask(x, _f0, t, self.sr)
            # save the f0 info for the later use
            np.save(output_file, f0)

        f0 = torch.from_numpy(f0).float()

        if self.data_augmentation:
            t = build_transforms()
            wave_tensor = t(wave_tensor)

        mel_tensor = self.to_melspec(wave_tensor)
        mel_tensor = (torch.log(1e-5 + mel_tensor) - self.mean) / self.std
        mel_length = mel_tensor.size(1)

        is_silence = torch.zeros(f0.shape)
        is_silence[self._get_log_norm(mel_tensor) < self.threshold] = 1

        if mel_length > self.max_mel_length:
            random_start = np.random.randint(0, mel_length - self.max_mel_length)
            mel_tensor = mel_tensor[:, random_start:random_start + self.max_mel_length]
            f0 = f0[random_start:random_start + self.max_mel_length]
            is_silence = is_silence[random_start:random_start + self.max_mel_length]

        return mel_tensor, f0, is_silence
980202006 commented 2 years ago

Thank you!

980202006 commented 2 years ago

Can you provide us with the code you used for enhancement? I am not very familiar with the parameter settings of data enhancement, etc.

yl4579 commented 2 years ago

@980202006 You may need to listen to the augmented audios and decide if it is a good choice. There are already a lot of examples in the repo, so be sure to read what each augmentation does and listen to the effects of them and find a reasonable composition and parameter settings. Since it depends on the dataset, I don't think what I have now would suit your need.

980202006 commented 2 years ago

Thank you! I add the auxiliary network, but it is not work, even worse. I have tried the mse loss, and it coverges at 1500.Is there any detail about your mse loss?

iehppp2010 commented 2 years ago

hi @yl4579 , another question about the ASR model. What is the PER criterion on your test dataset? And could you provide your "phone dict" file, I want to test your ASR model with the dict file on my own dataset. I want to validate whether an ASR model with lower PER would help to improve final audio quality.

GreatDarrenSun commented 2 years ago

@yl4579 I wish you all the best during your PhD. Regarding silent judgment, I still have some questions. 1.Why don't you use the fundamental frequency as a criterion for silence. 2.What is the value of self.bad_F0 and what are the switching conditions for harvest and DIO fundamental frequency extraction algorithms?

yl4579 commented 2 years ago

I'm sorry for the very late reply because I was busy at the end of the year.

@980202006 I don't know why it does not work for you, but I'll upload the training code very soon, so stay tuned for that. What is the classification loss for silence/speech and did you make the labels correctly using the norm threshold?

@iehppp2010 PER is defined as the classification error per phoneme, so for example, the word "pronunciation" has phonemes pruh-nown-see-AY-shun, and it may be misclassified as purh-noon-see-AY-shun, so the PER for this word is 40%, as 2/5 phonemes are incorrectly classified. The phone dict is the following punctuation tokens plus 70 phonemes from g2p library. I will also upload the training code soon.

"<pad>",0
"<sos>",1
"<eos>",2
"<unk>",3
" ",4
",",5
".",6
":",7
"!",8
"?",9

@GreatDarrenSun

  1. In fact I used both the norm threshold and F0 for silence, but the YIN algorithm is not perfect and it will cause low F0 prediction even if there is speech, so norm is also necessary sometimes.
  2. bad_F0 is the number of frames that are non-zero, so if it is lower than this number, the extraction failed. It is set to 5 as generally you should have more than 5 frames of F0 higher than 0, but sometimes the algorithm fails and it produces F0 with only a few non-zero frames.
GreatDarrenSun commented 2 years ago

@yl4579 Thank you for your reply during your busy schedule.I still have a few questions. Have you normalized the fundamental frequency f0. What is the Optimizer and scheduler selection and what is the learning rate setting? How many iterations does it converge, and what is the error during convergence?

ZhaoZeqing commented 2 years ago

@yl4579 Thanks for your great project! I'm a beginner of voice conversion and I have one question about the F0 module in voice conversion. In voice conversion, we want the converted sound to be more similar to the target speaker rather than the source speaker, but why is F0 extracted from the source audio instead of the target?

ahmeftah commented 2 years ago

Thanks for your good project and for your reply during your works. How I can use your work for emotional conversion from neutral to multi-different emotions at the same time ( neutral to sad, happy, and anger), any advices for that? And how training the model?

yl4579 commented 2 years ago

@GreatDarrenSun Do you refer to the F0 estimator? @ZhaoZeqing We do not have F0 for the target, the F0 here is the pitch of the input. @ahmeftah This voice conversion model can only do it with fixed timing (i.e., there is no change of timing) as it is purely CNN based. It is capable of converting into different emotions. You only need to supply your training data with different emotions and it can automatically learn the styles under the emotions.

ZhaoZeqing commented 2 years ago

@yl4579 Thanks for your reply. I have some questions about F0, if the source speaker is female and the target speaker is male, the F0 of male is lower than female, in my humble opinion, F0 should be extracted from the target speaker, otherwise, the F0 of the output audio maybe higher. But in many models of voice conversion, F0 is extracted from the source speaker. That's what I'm doubt about.

ahmeftah commented 2 years ago

Thanks for your reply

skol101 commented 2 years ago

May I chime in. I've noticed in SpeechSplit https://github.com/auspicious3000/SpeechSplit/blob/master/make_spect_f0.py this part when calculating F0 on VCTK dataset.

if spk2gen[subdir] == 'M': lo, hi = 50, 250 elif spk2gen[subdir] == 'F': lo, hi = 100, 600 else:

yl4579 commented 2 years ago

@ZhaoZeqing That information is encoded in the style vector instead of the F0 curve. The F0 curve is to make sure that the converted speech has the same pitch contour as the source input.

@skol101 This is calculating F0 range for spectrogram calculation. It is not relevant to our approach.

ZhaoZeqing commented 2 years ago

@yl4579 Thanks!

snakers4 commented 2 years ago

Hi @yl4579!

The F0 models are trained with MSE loss to reconstruct the F0 curves estimated from the Yin algorithm using PyWorld.

Stumbled upon your paper and this repo. Great job!

While developing a TTS system, we used several implementations of F0 extraction, and most of them worked very similarly for TTS.

I wonder why are you using a feature map from a network instead of this algorithm directly? While I understand why you are using an STT network instead of just adding some form of CTC conditioning, but why not just pass the F0 features as-is or after some rudimentary pre-processing or an embedding layer?

yl4579 commented 2 years ago

@snakers4 Sorry for the late reply. I was pretty busy at the end of my semester. The reason is two-fold:

  1. The model didn't use the F0 curve but the features. This is because the features are more robust than the curves which are only 1-d, while the feature maps are 512-d.
  2. The acoustic-based F0 estimation algorithms may fail sometimes due to non-stationary speech silence and other hyperparameter settings. You may refer to the original YIN paper for a detailed discussion. A F0-extraction network on the other hand generailzes better (and the failed extraction targets are treated as noises in the dataset, which NNs are largely robust to) and can be fine-tuned along with the voice conversion model for better performance.
snakers4 commented 2 years ago

Many thanks for the info!

yl4579 commented 2 years ago

@980202006 @Pathos0925 @wsstriving @zhangziyi-knu @GreatDarrenSun @iehppp2010 @ZhaoZeqing @ahmeftah @snakers4 @skol101

Sorry for the long wait, the training code is now available, thank you so much for your patience.

ahmeftah commented 2 years ago

Thank you so much for your great effort.

On Wed, Jun 15, 2022 at 8:15 AM Aaron (Yinghao) Li @.***> wrote:

Closed #9 https://github.com/yl4579/StarGANv2-VC/issues/9 as completed.

— Reply to this email directly, view it on GitHub https://github.com/yl4579/StarGANv2-VC/issues/9#event-6809702088, or unsubscribe https://github.com/notifications/unsubscribe-auth/AVTYQ4ME6Y24OQ6VNLUWR2LVPFRGVANCNFSM5DRXGDOA . You are receiving this because you were mentioned.Message ID: @.***>

-- = = = = = = = = = = = = == = = = = = = = = = = = = = = = = = = = = = = = = Ali Hamid Meftah College of Computer and Information Sciences http://ccis.ksu.edu.sa/en King Saud University = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =