Some doubt about any to any voice conversion

980202006 commented 2 years ago

Hi，thanks for this project. I have tried to remove the domain information of the style encoder, which does have a certain effect and can generate natural sound, but there are the following problems:

Low similarity with target speaker
The sound quality decreased significantly The reconstruction effect is better by inputting the original audio to the style encoder.

Data used:

60 speakers: 20 English speakers, 6 Chinese singer, one English and Korean singer (1 speaker), 12 English singer, and the rest are Chinese speech Batch size: 32 (8 per GPU) Can you provide some suggestions, whether data or model？

Kristopher-Chen commented 2 years ago

@Kristopher-Chen were you able to train MB-MelGAN from the mentioned repo? How did you manage to train it so it works with StarGanV2-VC? I can see that the main issue is with preprocess incompatibility.

@skol101 I'm trying HifiGAN. The thing you need to pay attention to is replacing the original feature extraction method with that in vc project. As for MB-MelGAN, I tried original settings and found phaseness problem, which is probably due to discontinuous phases between different bands.

Kristopher-Chen commented 2 years ago

@yl4579 @Kristopher-Chen How to extract F0 curve? I extract F0 by pyworld DIO and get the silence features only from F0, I also normalized F0, is it possible to do so? Looking forward to your reply:)

@ZhaoZeqing I also extract F0 by pyworld. Normalized F0 is applied by sequence.

ZhaoZeqing commented 2 years ago

@Kristopher-Chen Thanks for your reply! Did you use absolute values when training the F0 model like this? I don't know why the absolute value needs to be taken here.

Kristopher-Chen commented 2 years ago

@Kristopher-Chen Thanks for your reply! Did you use absolute values when training the F0 model like this? I don't know why the absolute value needs to be taken here.

@ZhaoZeqing actually I did not train F0 model. For the absolute function in JDCnet, the last layer is Linear Layer, which could not guarantee positive values. Using absolute value aims to make it converge faster and more stable maybe.

ZhaoZeqing commented 2 years ago

@Kristopher-Chen Thanks for your reply! Did you use absolute values when training the F0 model like this? I don't know why the absolute value needs to be taken here.

@ZhaoZeqing actually I did not train F0 model. For the absolute function in JDCnet, the last layer is Linear Layer, which could not guarantee positive values. Using absolute value aims to make it converge faster and more stable maybe.

@Kristopher-Chen Thanks!

yl4579 commented 2 years ago

@980202006 Yes, when we archive the new paper I will let you know. The new paper uses the ASR models too so I will make the training code available after it's made public, sorry it takes forever to be done.

@ZhaoZeqing The F0 model extracts the F0 curves, which is more robust than pyworld in my experience, even though it was trained using ground truth labels from pyworld. You can also use pyworld if you prefer, but you cannot fine-tune the F0 models along with the conversion model if you do that. As for training the F0 models, the absolute value actually makes training slower because the solution space is smaller and therefore hamper convergence. That is why for regression problems people usually use linear projection heads without any activation functions even though the solution is strictly positive.

ZhaoZeqing commented 2 years ago

@yl4579 Thanks for your reply! I used mean and variance to normalize the raw f0 values, so I did't use the absolute value here, the F0 model convergenced but the final VC model is bad. What normalization method did you use to process the f0 data?

10_f0

yl4579 commented 2 years ago

@ZhaoZeqing Did you normalize the F0 as the ground truth labels when training the F0 models? You should train the F0 models with absolute F0 in Hz (i.e., not normalized). When you train the voice conversion model, use the normalized F0 curve instead. If you train the F0 model with normalized F0, it will have scale problems as it has long-range dependencies imposed by the LSTM layer in JDC networks.

ZhaoZeqing commented 2 years ago

@yl4579 I used the normalized F0 as the ground truth labels to train the F0 model 😢 I'll try the absolute F0 in Hz. But how to use the normalized F0 curve when training the VC model? Do I need to add an instance normalization layers after the F0 model?

TimothyFDavison commented 2 years ago

@980202006

In addition, I noticed that there will be similarity problems between cross-domain speakers. For example, I train on Chinese singing data, but if the target speaker is English speaker from ljspeech, the similarity will be very low.

@980202006 How are you doing any to any mappings? Did you replace the original style encoder to some one-shot encoders like x-vector now?

I just use mel as input and remove the 'y'. X vector can also be the style encoder.

Thanks all for the great code and discussion! I'm new to working with these models, so please forgive what might be an introductory question. In order to run on unseen speakers by removing the y input, are you using the modified StyleEncoder class with just one linear layer in self.unshared (as seen in 980202006's comment on Nov 14th 2021)? Or are there other changes to the class' forward function that remove the dependency on the y parameter?

As a final question - is a pretrained model available for the modified (any to any) StyleEncoder class, or should I modify the class and train my own?

980202006 commented 2 years ago

I just delete the unshare linear layer, which happens to be a standard speaker recognition network after deletion.

980202006 commented 2 years ago

@yl4579 I want to further decouple the speaker information and F0 and content. I use VQVAE, but it fails, the model is difficult to converge, and the output audio is meaningless, meaningless human voice. I added vqvae's codebook loss to cyc loss, the code is as follows, can you help me to see where there is a problem? VQVAE is implemented using https://github.com/bshall/ZeroSpeech

        self.codebook = VQEmbeddingEMA(512, 512)
        self.codebook_f0 = VQEmbeddingEMA(256, 256)
        self.jitter = Jitter()
        self.jitter_2 = Jitter()
        self.mid_layer_1 = nn.Conv2d(512, 512, (7,3), 1,1)
        self.mid_layer_2 = nn.Conv2d(256, 256, (12,3), 1,1)
        self.linear = nn.Linear(1, 5)
        self.conv1x1_x = nn.Conv2d(1,5,1,1,0)
        self.conv1x1_f0 = nn.Conv2d(1,5,1,1,0)
    def codebook_encode(self, x,f0):
        x = self.mid_layer_1(x)  # 
        x = x.squeeze(2)
        x, x_loss, perplexity  = self.codebook(x.transpose(1,2))
        x = self.jitter(x)
        x = self.conv1x1_x(x.unsqueeze(2).transpose(2,1))
        f0 = self.mid_layer_2(f0)
        f0 = f0.squeeze(2)
        f0, f0_loss, perplexity  = self.codebook_f0(f0.transpose(1,2))
        f0 = self.jitter_2(f0)
        f0 = self.conv1x1_f0(f0.unsqueeze(2).transpose(2,1))
        return x.transpose(3,1).transpose(-1,-2),f0.transpose(3,1).transpose(-1,-2), x_loss, f0_loss
    def forward(self, x, s, masks=None, F0=None, ret_vq_loss=False):            
        x = self.stem(x)  # 
        cache = {}
        for block in self.encode:
            if (masks is not None) and (x.size(2) in [32, 64, 128]):
                cache[x.size(2)] = x
            x = block(x)  # 1,512,5,48
        x,F0, x_loss, f0_loss = self.codebook_encode(x, F0)
        if F0 is not None:
            F0 = self.F0_conv(F0)
            F0 = F.adaptive_avg_pool2d(F0, [x.shape[-2], x.shape[-1]])
            x = torch.cat([x, F0], axis=1)
        for block in self.decode:
            x = block(x, s)
            if (masks is not None) and (x.size(2) in [32, 64, 128]):
                mask = masks[0] if x.size(2) in [32] else masks[1]
                mask = F.interpolate(mask, size=x.size(2), mode='bilinear')
                x = x + self.hpf(mask * cache[x.size(2)])
        if ret_vq_loss:
            return self.to_out(x), x_loss, f0_loss
        else:
            return self.to_out(x)

Kristopher-Chen commented 2 years ago

Actually, we have to train the x-vector to adapt to the feature pre-processings instead of directly using the pre-trained models?

yl4579 commented 2 years ago

@980202006 My guess is probably the cycle loss. In the original paper, they used reconstruction loss instead of cycle loss. Cycle loss is harder to converge than reconstruction loss because of vanishing gradient (you will need to backpropagate through the model twice) and discriminator intervention (the gradient also contains adversarial loss, which is inherently unstable). If you use only reconstruction loss, you do not need to use adversarial loss and you only need to backpropagate through the model once. Try to use reconstruction loss here and see if the performance is better.

@Kristopher-Chen If you train with the style encoder after removing the unshared layers, it becomes a speaker recognition model as @980202006 said.

980202006 commented 2 years ago

@980202006 My guess is probably the cycle loss. In the original paper, they used reconstruction loss instead of cycle loss. Cycle loss is harder to converge than reconstruction loss because of vanishing gradient (you will need to backpropagate through the model twice) and discriminator intervention (the gradient also contains adversarial loss, which is inherently unstable). If you use only reconstruction loss, you do not need to use adversarial loss and you only need to backpropagate through the model once. Try to use reconstruction loss here and see if the performance is better.

@Kristopher-Chen If you train with the style encoder after removing the unshared layers, it becomes a speaker recognition model as @980202006 said. @yl4579 Thank you! I will try it. This does sound like the reason. I found that for some speakers I haven't seen before, the voice change effect is OK, but the effect of other speakers is poor. Is this because the vocal features of the person's speaker are not present in the dataset? Is the formant ratio sufficient to uniquely identify a person's timbre, or is there any absolute representation of a person's timbre？

980202006 commented 2 years ago

Actually, we have to train the x-vector to adapt to the feature pre-processings instead of directly using the pre-trained models?

This is not necessary, x-vector may improve generalization.

ZhaoZeqing commented 2 years ago

@yl4579 I used the normalized F0 as the ground truth labels to train the F0 model 😢 I'll try the absolute F0 in Hz. But how to use the normalized F0 curve when training the VC model? Do I need to add an instance normalization layers after the F0 model?

@yl4579 I trained the F0 and ASR model on my own dataset, I didn't normalize F0 when training the F0 model. Then I trained the VC model based on my F0 and ASR model, but the converted audio is not good as using your pretrained F0 and ASR model. I tried to add an instant normalization layer before adding F0 into generator, but it didn't work. Any suggestions about this?

Kristopher-Chen commented 2 years ago

Using multiple discriminators is effective, and when the model converges, the sound quality on the unseen speaker is better, and the similarity to the target speaker is better than the original one. If I use x-vector, the model can capture the sound characteristics that do not appear in the training set, but the sound quality is worse, and only capturing the characteristics does not improve the overall sound similarity very well. If you use the original style encoder, Many unseen sound characteristics will be lost. Can you give some optimization suggestions? In addition, I would like to ask how to fine-tune the unseen speaker based on the trained model, especially the discriminator does not reserve the index of unseen speake.

hi, how do you apply multiple discriminators? It seems quite complicated as it's related to the identities,

Kristopher-Chen commented 2 years ago

When training with more speakers, e.g. 100 speakers, have you ever met with the problem that after training for about 200 epochs, some of the losses become NAN?

yl4579 commented 2 years ago

@ZhaoZeqing Were you able to get the correct F0 curve using your own model? I will make the ASR and F0 training codes available soon. Something happened so I still didn't get a chance to clean it up.

@Kristopher-Chen This is likely because of the mixed-precision training. Note that if you enable FP16, your float will be 16 bits instead of 32 bits. When the value of your intermediate features gets above 65535, it will cause integer overflow and becomes NaN. You can either disable the mixed-precision or you can apply weight_norm to the weights so that the feature values are within 16 bits.

yl4579 commented 2 years ago

I found that for some speakers I haven't seen before, the voice change effect is OK, but the effect of other speakers is poor. Is this because the vocal features of the person's speaker are not present in the dataset? Is the formant ratio sufficient to uniquely identify a person's timbre, or is there any absolute representation of a person's timbre？

@980202006 A person's timbre is determined by a lot of things, it's not just the formant but also the energy and high-frequency harmonics (which by definition are the formats but we usually don't consider formants of higher orders). Note that AdaIN normalizes to those features of a person's voice, and the main reason it doesn't work very well for some speakers is the covariate shift (i.e., this speaker is too different from the speakers seen in the training set). I don't believe handcrafting any specific features helps here, most if not all deep learning problems can be solved by enlarging the model capacity and more data, unfortunately.

ZhaoZeqing commented 2 years ago

@yl4579 My own F0 model seems ok, like this:

but I didn't add noise for augmentation when training ASR and F0 model, is data augmentation necessary?

One more question, I want to train an any-to-one VC model, do I need to use Auto-Encoder instead of StarGAN?

980202006 commented 2 years ago

I found that for some speakers I haven't seen before, the voice change effect is OK, but the effect of other speakers is poor. Is this because the vocal features of the person's speaker are not present in the dataset? Is the formant ratio sufficient to uniquely identify a person's timbre, or is there any absolute representation of a person's timbre？

@980202006 A person's timbre is determined by a lot of things, it's not just the formant but also the energy and high-frequency harmonics (which by definition are the formats but we usually don't consider formants of higher orders). Note that AdaIN normalizes to those features of a person's voice, and the main reason it doesn't work very well for some speakers is the covariate shift (i.e., this speaker is too different from the speakers seen in the training set). I don't believe handcrafting any specific features helps here, most if not all deep learning problems can be solved by enlarging the model capacity and more data, unfortunately.

@yl4579 thanks. I also found that the audio recorded from the mobile phone h5 has poor sound-changing effect, similar to this example; on the contrary, the sound-changing effect of dry sound is ok. Is there any solution for mobile phone channel compensation or data enhancement? This is a dry voice conversion example. https://drive.google.com/drive/folders/1kcl8WH8r7MLP4iGrmNyHEe682XViR2_K?usp=sharing This is the result of a mobile phone recording. https://drive.google.com/drive/folders/115KJUzg7wvKHHZkJI2loBZJ90Fp4pV-L

980202006 commented 2 years ago

This is more likely to be a problem with your data or model, or a back-propagation problem caused by the torch statement. Since it cannot fit the data well, the model is constantly trying to increase or decrease the scale of the data.

980202006 commented 2 years ago

使用多个判别器是有效的，当模型收敛时，看不见的说话人上的音质更好，与目标说话人的相似度优于原始说话人。如果我用x-vector，模型可以捕捉到训练集中没有出现的声音特征，但是音质更差，只捕捉特征并不能很好的提升整体声音的相似度。如果您使用原始风格的编码器，许多看不见的声音特征会丢失。能给一些优化建议吗？另外想问下如何根据训练好的模型对unseen speake进行微调，尤其是discriminator没有保留unseen speake的索引。

嗨，你如何应用多个鉴别器？好像很复杂，因为和身份有关，

I am still trying to sort out the ideas here. The basic idea is to use multiple discriminators, each of which only discriminates a part of the speakers (random selection).

yl4579 commented 2 years ago

@980202006 The clean voice sounds very good, though fine-tuning the vocoder would improve the sound quality. You may want to use vocoders specifically designed for singing synthesis.

However, I cannot listen to the mobile phone recorded results. I don't have permission for that, can you share the folder please?

Although I can't listen to the samples, my guess is that voices recorded with mobile phones are worse in sound quality so the speakers' characteristics cannot be well captured by the model. You can either use data augmentation to corrupt the input to the style encoder for a more robust style representation or you can just do speech enhancement to make the sound quality better. This for example sounds exceptionally good: https://daps.cs.princeton.edu/projects/Su2021HiFi2/index.php

980202006 commented 2 years ago

@yl4579 Thank you, is there a better way to use data augmentation? I tried the common data augmentation: adding reverberation and noise, but no good results were achieved. I modified the folder permissions and it should be viewable now.

980202006 commented 2 years ago

@yl4579 I'm missing something about the problems deep learning might have. Are there reviews that cover various issues, such as covariate shift?

yl4579 commented 2 years ago

@980202006 Did you add reverb to the input for the style encoder? How did you the data augmentation? As for the deep learning problems, I believe this is less a problem for deep learning but more for machine learning. I'd suggest you take systematic machine learning classes that focus on the theories (instead of the practices).

980202006 commented 2 years ago

@yl4579 Yes, I added reverb to the input data of the style encoder. Thank you.

Kristopher-Chen commented 2 years ago

@yl4579 @980202006 I found the speech intelligibility gets worse compared to the sources, especially when I test Chinese in a model trained by English. How to relieve this phenomenon?

And @980202006 are you using a multi-language ASR for the multi-speaker training, as your datasets include Chinese, English, and singing?

Kristopher-Chen commented 2 years ago

@980202006 I listened to your demo, and I think they are pretty good, especially the speech intelligibility is very good. How do you manage that? Could you leave an e-mail for more discussions for details in Chinese demos?

yl4579 commented 2 years ago

@980202006 How does the result differ when your input to the style encoder is reverberated and not reverberated? Do they sound similar or quite different?

yl4579 commented 2 years ago

@Kristopher-Chen the original model was not proposed to tackle cross-lingual voice conversion, so you may need to train an ASR model that works for both English and Chinese (e.g., using IPAs) and train a model with both English and Chinese data. The ASR training code will be made available soon, at the latest in late May.

Kristopher-Chen commented 2 years ago

@Kristopher-Chen the original model was not proposed to tackle cross-lingual voice conversion, so you may need to train an ASR model that works for both English and Chinese (e.g., using IPAs) and train a model with both English and Chinese data. The ASR training code will be made available soon, at the latest in late May.

@yl4579 Recently, I trained a model with 100 speakers from VCTK. When evaluating, I met with some problems.

https://drive.google.com/drive/folders/1lraGNF3tGzExGnmhvXo3QDrc72uE23zg?usp=sharing

1) Speech intelligibility degradation, as mentioned, even in seen speakers. You can find the examples from the link above. Also, I found the ASR loss decreased from about 0.3 to 0.1 when the speakers increased from 20 to 100.

2) Reference for style encoder. When testing with different references with the same target speaker, the results vary significantly, and some may become unacceptable. Also, examples from the link above. Should I use the average of more sentences, otherwise how to choose a proper reference?

980202006 commented 2 years ago

@980202006 How does the result differ when your input to the style encoder is reverberated and not reverberated? Do they sound similar or quite different?

If the reverberation data (style encoder) is added during training, it will alleviate the problem during unseen inference, but it will not completely solve the problem. If the reverberation is more obvious, training on the model without adding reverberation data will result in poor results, and each word is blurred. I can't find the old model, I'm training a new model, I expect to give an example in early May. Thank you!

980202006 commented 2 years ago

@Kristopher-Chen fujindemi@gmail.com

Maybe you want to see if your speaker classification discriminator has collapsed. Training this discriminator must be careful. I guess it is better to not let its loss drop towards 0, but to give a balanced value.

MMMMichaelzhang commented 2 years ago

@Kristopher-Chen the original model was not proposed to tackle cross-lingual voice conversion, so you may need to train an ASR model that works for both English and Chinese (e.g., using IPAs) and train a model with both English and Chinese data. The ASR training code will be made available soon, at the latest in late May.

How is the asr training code progressing now？I am very looking forward to it。

yl4579 commented 2 years ago

@MMMMichaelzhang It is available here: https://github.com/yl4579/AuxiliaryASR

Charlottecuc commented 2 years ago

@MMMMichaelzhang It is available here: https://github.com/yl4579/AuxiliaryASR

Hi. How is the JDC code progressing? Thank you very much~

yl4579 commented 2 years ago

@Charlottecuc I'm still working on it, I'll create another repo probably by this week.

yl4579 commented 2 years ago

@Charlottecuc The training code for F0 model is available now: https://github.com/yl4579/PitchExtractor

CrackerHax commented 1 year ago

I trained on a set from an audio book, over 4000 samples from a single voice and training 110 epochs. When I generate it sounds like the source audio, not the trained voice OR the style. Any idea what the problem could be? Do I just need a lot more training or what?

yl4579 commented 1 year ago

@CrackerHax Your loss becomes nan, so the model is broken. This is likely caused by bad normalization because some value exceeds 65535 (float 16 maximum number). See https://github.com/yl4579/StarGANv2-VC/issues/6#issuecomment-1098692527

CrackerHax commented 1 year ago

@CrackerHax Your loss becomes nan, so the model is broken. This is likely caused by bad normalization because some value exceeds 65535 (float 16 maximum number). See #6 (comment)

I trained again with fp16=false and still got nan (at the same epoch as fp16=true). The only change I made in config file was it's a single voice (num_domains: 1) Dataset is about 4000 samples at 24000 and I was training from scratch (no transfer)

CrackerHax commented 1 year ago

@CrackerHax Your loss becomes nan, so the model is broken. This is likely caused by bad normalization because some value exceeds 65535 (float 16 maximum number). See #6 (comment)

I did some transfer learning with 20 voices on the default model and it worked fine.

Liujingxiu23 commented 1 year ago

@980202006 @yl4579 Your discussion is very enlightening, however as a beginner, I really can't fully understand all your discussion. my subject is cross-domain singing voice conversion for only four speakers, spkeaer 1 and 2 are singers with only songs dataset, and speaker 3 and 4 are speakers with only speech dataset. what I want to do is only 1/2 --> 3/4, to let speakers 3 and 4 to have song data. all speakers are chinese speakers. what should be to improve the result? 1.can I remove style-encoder and map_encoder and just use one-hot-speaker-embedding? Will it help? 2.should I remove loss_f0_sty. 3.what the current ASR model and F0 model preformed on song datas? Is it necessary to retrain these two models? Do you have any other suggestions? Thank you again.

MMMMichaelzhang commented 1 year ago

I set num_domain=1 and I meet the same problem,have you sovled it?@CrackerHax

yl4579 / StarGANv2-VC

Some doubt about any to any voice conversion #6