Closed 980202006 closed 2 years ago
The training code is very messy, so it may take yet another several weeks or even more time to be done. I figure it'll be faster if you just refer to other repos to train your own models, but if you are willing to wait I may try to clean up the code and make them public.
The F0 models are trained with MSE loss to reconstruct the F0 curves estimated from the Yin algorithm using PyWorld. The ASR models are phoneme-level joint CTC-attention models. The losses are CTC loss on the input from the CNN and the cross-entropy loss on the input from the RNN.
Both F0 and ASR models are trained with data augmentation that can be found here: https://github.com/iver56/audiomentations. You may want to download the impulse response dataset here and the background noise dataset can be downloaded here.
Thank you! I will try to implement the code, and I am willing to wait for your training code.
Hello! I would also be interested in the training code for the ASR and F0 models. Thank you for your work on this code, it is very impressive!
The training code is very messy, so it may take yet another several weeks or even more time to be done. I figure it'll be faster if you just refer to other repos to train your own models, but if you are willing to wait I may try to clean up the code and make them public.
The F0 models are trained with MSE loss to reconstruct the F0 curves estimated from the Yin algorithm using PyWorld. The ASR models are phoneme-level joint CTC-attention models. The losses are CTC loss on the input from the CNN and the cross-entropy loss on the input from the RNN.
Both F0 and ASR models are trained with data augmentation that can be found here: https://github.com/iver56/audiomentations. You may want to download the impulse response dataset here and the background noise dataset can be downloaded here.
Hi, when you say you trained the F0 Network using MSE, does it mean it's not the same as the procedure adopted in the original JDC paper (Multitask CE loss) ? Thanks!
@wsstriving Yes, the F0 network is different from the original JDC paper. The num_class = 1
and the output is 1 dimensional scalar of the F0 in Hertz.
@wsstriving Yes, the F0 network is different from the original JDC paper. The
num_class = 1
and the output is 1 dimensional scalar of the F0 in Hertz.
Thanks for the reply, I wonder whether you compared this mse method with the original classification (722 classes)? I am implementing this part and met a few unclear parts when referring to your code:
It would be great if you could help to clarify the above questions, thanks a lot! Shuai
@wsstriving I didn't compare the MSE with classification because the model was trained on the speech data, so there are no discrete labels as in the singing data. If you want to do singing conversion, using discrete labels is probably better, but you will also have to normalize predictions to account for the difference between male and female speakers.
The voice detection branch is not used for inference, but it was used during training because we want the model to be robust against noisy inputs, as during training we added background noise and other filters to make the model more robust. The silence labels at each frame can be easily inferred by thresholding the norms of mels.
The 31-frame context window was used in the original implementation, but it is not necessary either. If you look more closely, you'll see that seq_len
in model.py does not appear in any of the parameters and therefore can be inferred from the input shape. I set it to 192 for convenience but really doesn't matter because seq_len = x.shape[-1]
is the actual setting used. That is, seq_len
is the number of frames in the melspectorgrams. I believe this is either a mistake or redundant code from the original implementation.
Hi,how about the F0 and ASR code?
I'm still testing the code and will try to get it done as soon as possible.
Great!
Hi, Thank you for your outstanding work and anyone completed the code for F0 and ASR? @yl4579 @980202006 It would be very appreciated if you can provide it :)
@zhangziyi-knu Sorry for the delay, but I'm pretty busy with my coursework, so it may take a few more weeks to get the code cleaned and tested.
@yl4579 Thank you very much for your great work. Can you provide more details about ASR loss? As you mentioned,The ASR models are phoneme-level joint CTC-attention models. The losses are CTC loss on the input from the CNN and the cross-entropy loss on the input from the RNN.How these losses are implemented.Very much looking forward to your reply.
@yl4579 How did you train the ASR network without labels?
I tried to implement it and found a problem: for the silent segment f0, the model fits poorly. As shown below.
I did not find the yin algorithm in pyworld, can you give a code example?
@GreatDarrenSun It is not possible to train an ASR model without labels, but the ASR models are not that important in terms of what language it was trained on. I found that ASR models trained on English corpus also work with Japanese datasets, so the method can be said to be unsupervised. The point of ASR loss is to make sure the model preserves the formants after conversion. This is why we don't use the final predictions (PPG) but the intermediate activations for loss because we believe the intermediate layers of ASR models learn the formants from the MFCC that are largely independent of the language.
Here is the code for the loss part of the training. The rest of them is already in the models.py. I'm really sorry it takes so long, but I'm in the last semester of my master's program and am about to start my PhD so it's very busy. Hopefully I'll make the full training code available during the winter break.
In fact, you don't really need this S2S loss here, any CNN-based CTC model would do the job. The reason to use this model is that it also provides phoneme alignment that can be transferred to TTS tasks.
text_input, text_input_length, mel_input, mel_input_length = batch
mel_input_length = mel_input_length // (self.model.n_down) # downsample the final frames
mel_mask = self.model.length_to_mask(mel_input_length)
ppgs, s2s_pred, s2s_attn = self.model(
mel_input, src_key_padding_mask=mel_mask, text_input=text_input)
loss_ctc = torch.nn.functional.ctc_loss(ppgs.log_softmax(dim=2).transpose(0, 1),
text_input, mel_input_length, text_input_length) # CNN CTC Loss
loss_s2s = 0 # RNN S2S Loss
for _s2s_pred, _text_input, _text_length in zip(s2s_pred, text_input, text_input_length):
loss_s2s += torch.nn.functional.cross_entropy(_s2s_pred[:_text_length], _text_input[:_text_length])
loss_s2s /= text_input.size(0)
loss = loss_ctc + loss_s2s
loss.backward()
@980202006 Did you use the silence prediction as well? You'll need to train the model exactly as presented in the JDC paper, where you have to estimate the F0 and silence jointly, otherwise it may still predict some erroneous F0 values during the silence.
The Yin algorithm is "harvest", see an example here. Also, I have attached my data loader function if it helps.
def path_to_mel_and_label(self, path):
wave_tensor = self._load_tensor(path)
# use pyworld to get F0
output_file = path[0] + "_f0.npy"
# check if the file exists
if os.path.isfile(output_file): # if exists, load it directly
f0 = np.load(output_file)
else: # if not exist, create the F0 file
x = wave_tensor.numpy().astype("double")
frame_period = MEL_PARAMS['hop_length'] * 1000 / self.sr
_f0, t = pw.harvest(x, self.sr, frame_period=frame_period)
if sum(_f0 != 0) < self.bad_F0: # this happens when the algorithm fails
_f0, t = pw.dio(x, self.sr, frame_period=frame_period) # if harvest fails, try dio
f0 = pw.stonemask(x, _f0, t, self.sr)
# save the f0 info for the later use
np.save(output_file, f0)
f0 = torch.from_numpy(f0).float()
if self.data_augmentation:
t = build_transforms()
wave_tensor = t(wave_tensor)
mel_tensor = self.to_melspec(wave_tensor)
mel_tensor = (torch.log(1e-5 + mel_tensor) - self.mean) / self.std
mel_length = mel_tensor.size(1)
is_silence = torch.zeros(f0.shape)
is_silence[self._get_log_norm(mel_tensor) < self.threshold] = 1
if mel_length > self.max_mel_length:
random_start = np.random.randint(0, mel_length - self.max_mel_length)
mel_tensor = mel_tensor[:, random_start:random_start + self.max_mel_length]
f0 = f0[random_start:random_start + self.max_mel_length]
is_silence = is_silence[random_start:random_start + self.max_mel_length]
return mel_tensor, f0, is_silence
Thank you!
Can you provide us with the code you used for enhancement? I am not very familiar with the parameter settings of data enhancement, etc.
@980202006 You may need to listen to the augmented audios and decide if it is a good choice. There are already a lot of examples in the repo, so be sure to read what each augmentation does and listen to the effects of them and find a reasonable composition and parameter settings. Since it depends on the dataset, I don't think what I have now would suit your need.
Thank you! I add the auxiliary network, but it is not work, even worse. I have tried the mse loss, and it coverges at 1500.Is there any detail about your mse loss?
hi @yl4579 , another question about the ASR model. What is the PER criterion on your test dataset? And could you provide your "phone dict" file, I want to test your ASR model with the dict file on my own dataset. I want to validate whether an ASR model with lower PER would help to improve final audio quality.
@yl4579 I wish you all the best during your PhD. Regarding silent judgment, I still have some questions. 1.Why don't you use the fundamental frequency as a criterion for silence. 2.What is the value of self.bad_F0 and what are the switching conditions for harvest and DIO fundamental frequency extraction algorithms?
I'm sorry for the very late reply because I was busy at the end of the year.
@980202006 I don't know why it does not work for you, but I'll upload the training code very soon, so stay tuned for that. What is the classification loss for silence/speech and did you make the labels correctly using the norm threshold?
@iehppp2010 PER is defined as the classification error per phoneme, so for example, the word "pronunciation" has phonemes pruh-nown-see-AY-shun
, and it may be misclassified as purh-noon-see-AY-shun
, so the PER for this word is 40%, as 2/5 phonemes are incorrectly classified. The phone dict is the following punctuation tokens plus 70 phonemes from g2p library. I will also upload the training code soon.
"<pad>",0
"<sos>",1
"<eos>",2
"<unk>",3
" ",4
",",5
".",6
":",7
"!",8
"?",9
@GreatDarrenSun
bad_F0
is the number of frames that are non-zero, so if it is lower than this number, the extraction failed. It is set to 5 as generally you should have more than 5 frames of F0 higher than 0, but sometimes the algorithm fails and it produces F0 with only a few non-zero frames. @yl4579 Thank you for your reply during your busy schedule.I still have a few questions. Have you normalized the fundamental frequency f0. What is the Optimizer and scheduler selection and what is the learning rate setting? How many iterations does it converge, and what is the error during convergence?
@yl4579 Thanks for your great project! I'm a beginner of voice conversion and I have one question about the F0 module in voice conversion. In voice conversion, we want the converted sound to be more similar to the target speaker rather than the source speaker, but why is F0 extracted from the source audio instead of the target?
Thanks for your good project and for your reply during your works. How I can use your work for emotional conversion from neutral to multi-different emotions at the same time ( neutral to sad, happy, and anger), any advices for that? And how training the model?
@GreatDarrenSun Do you refer to the F0 estimator? @ZhaoZeqing We do not have F0 for the target, the F0 here is the pitch of the input. @ahmeftah This voice conversion model can only do it with fixed timing (i.e., there is no change of timing) as it is purely CNN based. It is capable of converting into different emotions. You only need to supply your training data with different emotions and it can automatically learn the styles under the emotions.
@yl4579 Thanks for your reply. I have some questions about F0, if the source speaker is female and the target speaker is male, the F0 of male is lower than female, in my humble opinion, F0 should be extracted from the target speaker, otherwise, the F0 of the output audio maybe higher. But in many models of voice conversion, F0 is extracted from the source speaker. That's what I'm doubt about.
Thanks for your reply
May I chime in. I've noticed in SpeechSplit https://github.com/auspicious3000/SpeechSplit/blob/master/make_spect_f0.py this part when calculating F0 on VCTK dataset.
if spk2gen[subdir] == 'M': lo, hi = 50, 250 elif spk2gen[subdir] == 'F': lo, hi = 100, 600 else:
@ZhaoZeqing That information is encoded in the style vector instead of the F0 curve. The F0 curve is to make sure that the converted speech has the same pitch contour as the source input.
@skol101 This is calculating F0 range for spectrogram calculation. It is not relevant to our approach.
@yl4579 Thanks!
Hi @yl4579!
The F0 models are trained with MSE loss to reconstruct the F0 curves estimated from the Yin algorithm using PyWorld.
Stumbled upon your paper and this repo. Great job!
While developing a TTS system, we used several implementations of F0 extraction, and most of them worked very similarly for TTS.
I wonder why are you using a feature map from a network instead of this algorithm directly? While I understand why you are using an STT network instead of just adding some form of CTC conditioning, but why not just pass the F0 features as-is or after some rudimentary pre-processing or an embedding layer?
@snakers4 Sorry for the late reply. I was pretty busy at the end of my semester. The reason is two-fold:
Many thanks for the info!
@980202006 @Pathos0925 @wsstriving @zhangziyi-knu @GreatDarrenSun @iehppp2010 @ZhaoZeqing @ahmeftah @snakers4 @skol101
Sorry for the long wait, the training code is now available, thank you so much for your patience.
Thank you so much for your great effort.
On Wed, Jun 15, 2022 at 8:15 AM Aaron (Yinghao) Li @.***> wrote:
Closed #9 https://github.com/yl4579/StarGANv2-VC/issues/9 as completed.
— Reply to this email directly, view it on GitHub https://github.com/yl4579/StarGANv2-VC/issues/9#event-6809702088, or unsubscribe https://github.com/notifications/unsubscribe-auth/AVTYQ4ME6Y24OQ6VNLUWR2LVPFRGVANCNFSM5DRXGDOA . You are receiving this because you were mentioned.Message ID: @.***>
-- = = = = = = = = = = = = == = = = = = = = = = = = = = = = = = = = = = = = = Ali Hamid Meftah College of Computer and Information Sciences http://ccis.ksu.edu.sa/en King Saud University = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =
I want to retrain the ASR and F0 models with my own data set. Can I provide the code, or are there any precautions when building the training code?