yl4579 / StarGANv2-VC

StarGANv2-VC: A Diverse, Unsupervised, Non-parallel Framework for Natural-Sounding Voice Conversion
MIT License
466 stars 110 forks source link

Some doubt about any to any voice conversion #6

Open 980202006 opened 2 years ago

980202006 commented 2 years ago

Hi,thanks for this project. I have tried to remove the domain information of the style encoder, which does have a certain effect and can generate natural sound, but there are the following problems:

  1. Low similarity with target speaker
  2. The sound quality decreased significantly The reconstruction effect is better by inputting the original audio to the style encoder.

Data used:

  1. 60 speakers: 20 English speakers, 6 Chinese singer, one English and Korean singer (1 speaker), 12 English singer, and the rest are Chinese speech Batch size: 32 (8 per GPU) Can you provide some suggestions, whether data or model?
980202006 commented 2 years ago

In addition, the mapping network training is even worse

980202006 commented 2 years ago

Whether to remove the mapping network, if I only use it with reference audio.

yl4579 commented 2 years ago

What you are asking is an open research question that nobody has an answer to at this point, but I will give you my two cents on this issue. It is only a discussion, not meant to provide any viable solutions.

In general, there are mainly two ways to do voice conversion:

The first method usually suffers from poor sound quality because it is difficult to completely disentangle speakers from speech while keeping enough information to reconstruct the speech with high quality (unless you use text labels which make it a TTS system and thus impossible to work in real-time), while the latter suffers from dissimilarity as the input speaker information is often leaked into the decoder. This paper introduces adversarial classifier loss to mitigate the second problem, so we can guarantee the converted results sound similar to the target speaker for seen input speakers and sometimes for unseen input speakers while maintaining a reasonable degree of naturalness in synthesized speech.

However, when it comes to zero-shot conversion, the trick of adversarial classifier loss is no longer applicable, because such a classifier is even not able to find patterns for only less than a hundred speakers, let alone thousands of speakers that are usually required to train zero-shot conversion models. In addition, if you read the original StarGAN v2 paper, you will see that the style encoder is trained to only reconstruct the image, and hence it works well for reconstruction but works poorly for conversion when the disentanglement in the encoder is not sufficient and when there're so many speakers that the style space becomes extremely complicated and the discriminator loses track of bad samples from the generator.

That is to say, if you want to do the zero-shot conversion, you will need to work heavily on improving the current discriminator settings. For example, build a set of discriminators each of which only works on a subset of speakers, or use speaker embeddings to help the model set the right goals for discriminations.

You can also instead disentangle the input speakers as much as possible and try to reconstruct the speech with the given style. There are several ways of disentangling the input speaker information, for example, Huang et. al. 2020. Another way is to use speaker agnostic features such that PPG and F0 to reconstruct the speech, but spoiler alert these features are usually not good enough to synthesize natural-sounding speech.

Of course, if you can find a way to make the adversarial classifier work in the zero-shot setting while keeping the same sound quality, I believe it will deserve a machine learning top conference publication such as in NIPS or ICML.

980202006 commented 2 years ago

Thanks! I will try to add multi band loss like hifigan, and "SEQUENCE-TO-SEQUENCE SINGING VOICE SYNTHESIS "With PERCEPTUAL ENTROPY LOSS" loss. If there is progress, I will share with you as soon as possible

980202006 commented 2 years ago

Hello, I recently tried some solutions to achieve any to any oice conversion. Simply increasing the number of speakers is the best result so far. I am trying to use x-vector as a style encoder recently. Is there anything I need to pay attention to? In addition, I want to try a cross-domain conversion, like singing, but I encountered a problem: when the F0 of the source is low and the F0 of the target is high, F0 will jitter. In addition, it is difficult to further improve the similarity with the target speaker or singer. Are there any suggestions for improvement?

980202006 commented 2 years ago

In addition, you mentioned that when there are too many speakers, the speaker discriminator will be difficult to converge. Can you change its loss to other loss?

yl4579 commented 2 years ago

Sorry for the late reply. I hope you've got some good results using x-vector, though I believe it would not work better than style encoder alone because x-vector has much less information about the target speaker than the trained style encoder does.

The jittering F0 is probably caused by how the F0 features are processed by the encoder. It is only processed by a single ResBlock, which is unlikely to remove all the input F0 information. The subsequent AdaIN blocks have to transform these low-pitch features to high-pitch features, making it difficult and inevitably lose detailed information and thus jitter. My suggestion is you add a few more instance normalization layers to process the F0 feature and hopefully the features fed into the decoder only contain the pitch curves instead of the exact F0 value in Hz, which is what the model was trained for.

The problems of low similarity with a large number of speakers are probably caused by the limited capacity of the discriminators. I do not have any good suggestions for you, but you may try something like large hypernetworks that generate weights of discriminators for each individual speaker after some shared layers to further process speaker-specific characteristics. This can also be applied to the mapping network. The basic idea is to make the discriminators powerful enough to memorize the characteristics of each speaker. Another very simple way is to have multiple discriminators, each of which only acts on a specific set of speakers. For example, discriminator 1 is trained on speakers 1 to 10, 2 is trained on 11 to 20, and so on.

980202006 commented 2 years ago

Thank You!I will try it.If there is any progress, I will share with you as soon as possible.

980202006 commented 2 years ago

Using multiple discriminators is effective, and when the model converges, the sound quality on the unseen speaker is better, and the similarity to the target speaker is better than the original one. If I use x-vector, the model can capture the sound characteristics that do not appear in the training set, but the sound quality is worse, and only capturing the characteristics does not improve the overall sound similarity very well. If you use the original style encoder, Many unseen sound characteristics will be lost. Can you give some optimization suggestions? In addition, I would like to ask how to fine-tune the unseen speaker based on the trained model, especially the discriminator does not reserve the index of unseen speake.

yl4579 commented 2 years ago

I think it depends on the number of speakers you have in the training set and what your latent space of the speaker embedding looks like. Usually, a multivariable Gaussian assumption is what people would use, so you may want to add an additional loss term to the latent variables from the style encoder or x-vector to enforce the underlying Gaussian distribution (an L2 norm would do the job). When you say many unseen sound characteristics are lost, what do you mean exactly by "sound characteristics"? Can you give some examples of the "lost characteristics" versus what the "characteristics" should actually be like?

Another way to test if the latent space actually encodes unseen speakers that are readily available to use by the generator is to use gradient descent to find the style that reconstructs the unseen speaker's speech. That is, after training your model, you simply fix everything and make the style vector a trainable parameter, and use the gradient descent to minimize the reconstruction loss between the input mel and output mel of unseen speakers. If the loss does not converge to a reasonable value, it means there's no style in the space the generator has learned to faithfully reconstruct unseen speakers' speech.

One easy way to finetune for unseen speakers is to simply remove the lost projection layer that converts the 512 channels to number of speakers. Another more complicated way is to use a hyperntework or weight AdaIN (see Chen et. al. ) so that the discriminator is speaker-independent but only style dependent. You will need to train a style encoder for the discriminator too though, or use a pretrained x-vector for that purpose.

980202006 commented 2 years ago

https://drive.google.com/drive/folders/1lQO7ZtWN6MvyZeMFwoB2L0AjDPL_9V1p?usp=sharing Ref_wav is the target wav. Y_out is the output of model.1300y_out is obtained by replacing the style encoder with trainable parameters in 1300 steps and using stochastic gradient descent training. I found that the gradient is mainly concentrated in the InstanceNorm2d layer of the decoder. As y_out can hear, compared to ref_wav, some people's voice characteristics are lost.

yl4579 commented 2 years ago

I think 1300y_out is very similar to Ref_wav, so the good news is that the generator is capable of reconsrtucting unseen speakers without any further training. Have you tried to use the style obtained with gradient descent to convert other input audio? Does it work? If so, at least the model can do one-shot learning with a few iterations of gradient descent.

You're right that Y_out does not sound very similar to Ref_wav though, is this the result from X-vectors or style encoders without specific speakers? If the style obtained from gradient descent works with other input, it means the problem is not in the generator or the discriminator, but the style encoder that is unable to find a style embedding space with unseen speakers. If the style does not work with other input, it means the encoder of the generator may have been overfitted to reconstruct the input, so disentangling the input speaker information may be necessary.

980202006 commented 2 years ago

1300y_out is the result with style encoder. 1300x_vector_out is the result with x_vector.I test the style obtained from gradient descent on another song. 0y_out_huangmeixi_with_f0 is the result from model. 1300y_out_huangmeixi_with_f0 is the result with the style obtained from gradient descent. If you give the wrong f0, the output will be out of tune in individual tones instead of all of them. 0y_out_huangmeixi_error_f0 is the result of the model output, using the f0 of the previous song in the current song. In other words, the encoder will also encode f0. In addition, the style encoder will also affect the output environmental noise level. The reconstruction loss (L1 loss) using SGD is as follows, printed once every 100steps. image

yl4579 commented 2 years ago

This looks promising, so the problem probably is in the style encoder then. Can I know how many speakers you used to train the style encoder and how many discriminators were there and how you assigned these discriminators to those speakers?

By the way, I didn't see "0y_out_huangmeixi_error_f0", maybe you didn't upload it there, so I'm not sure what you meant by "In other words, the encoder will also encode f0."

It is expected that the style encoder encodes the background noise, and it is actually the most obvious thing it will encode given how the loss is set up. However, if you don't want it to encode the recording environment, you can use the contrastive loss to make it noise-robust. That is, generate a noise degenerated copy of your audio and make the style encoder encode both of them into the same style vector. This is also usually how speaker embeddings like x-vector are trained.

980202006 commented 2 years ago

Sorry for the late reply. A total of 117 speakers are used as the data set. There may be some noise in these data, including the sound of mouse clicks, pink noise, etc., but the sound is not loud. Twenty are English speech, and the rest are singings. I re-uploaded “0y_out_huangmeixi_error_f0”. Thank you!

980202006 commented 2 years ago

One discriminator for every 10 speakers. So here are 12 discriminators. I haven't had time to try other speaker and discriminator correspondences.I also did not try to share parameters between the discriminators.

yl4579 commented 2 years ago

I have listened to "0y_out_huangmeixi_error_f0" you uploaded and if I understand correctly, you probably think the style is somehow "overfitted" in the sense that it also encodes the F0 of the reconstruction target? I think this is not true, because a vector of size 64 can't encode a whole F0 curve, but one training objective is the average pitch of the reference is the same as the average pitch of the converted output, so it definitely learns the average F0. It also encodes how the pitch would deviate from the input F0 because the style diversification loss also tries to maximize the F0 between two different styles. Hence, the style also encodes some information about the speaking/singing style of the target, which is desirable in our case.

The discriminator settings seem fair, but how did you train the style encoder? Are you still using the unshared linear projection or the style encoder is now independent of the input speakers? What about the mapping network? Did you remove the mapping network in its entirety?

980202006 commented 2 years ago

Sorry for the late reply. I remove the mapping network. I use the origin network,unshared linear projection. Have you tried the improvements of stylegan2? According to my observation, if the sample input is fixed and optimized continuously with sgd, the gradient is mainly concentrated in instance normal.In addition, can bCR-GAN loss be replaced by StyleGAN2 with adaptive discriminator augmentation (ADA)?

980202006 commented 2 years ago

There is a problem with breathing sound modeling, is there a way to deal with it?

yl4579 commented 2 years ago

I don't think StyleGAN2 is relevant to StarGANv2, because the main difference in StyleGAN2 is they changed the instance normalization without the affine component (i.e., only normalize and learn the standard deviation, not the mean). The same setting hurts the performance in StarGANv2 as our model decodes from a latent space encoded by the encoder instead of noise, so it's not really that relevant. I believe StyleGAN3 is more relevant if you are willing to try to implement an aliasing-free generator instead.

As for ADA, I was not able to find a set of augmentation and probability such that no leaks occur, which is the main reason I was using bCR-GAN. The augmentation didn't matter that much if you have enough data, so it doesn't really help for the VCTK-20 dataset. I put it there only for cases where some speakers have much less than data others (like only 5 mins instead of 30 mins as in VCTK). It helps with emotional conversion and noisy datasets though.

I didn't encounter any problems with the breath sound. You can listen to the demo here and the breath can be heard clearly. I guess it's probably your dataset is noisy so the breath sound was filtered as noise by the encoder. In that case, you may want to intentionally corrupt your input by audio augmentation.

yl4579 commented 2 years ago

Back to the style encoder problem, how do you encode unseen speakers if you have unshared components?

980202006 commented 2 years ago

Sorry, there is a misunderstanding in the description here. I converted the non-shared mapping to the shared mapping, as shown in the code below.

` class StyleEncoder(nn.Module): def init(self, dim_in=48, style_dim=48, num_domains=2, max_conv_dim=384): super().init() blocks = [] blocks += [nn.Conv2d(1, dim_in, 3, 1, 1)]

    repeat_num = 4
    for _ in range(repeat_num):
        dim_out = min(dim_in*2, max_conv_dim)
        blocks += [ResBlk(dim_in, dim_out, downsample="half")]
        dim_in = dim_out

    blocks += [nn.LeakyReLU(0.2)]
    blocks += [nn.Conv2d(dim_out, dim_out, 5, 1, 0)]
    blocks += [nn.AdaptiveAvgPool2d(1)]
    blocks += [nn.LeakyReLU(0.2)]
    self.shared = nn.Sequential(*blocks)

    # self.unshared = nn.ModuleList()
    # for _ in range(num_domains):
    #     self.unshared += [nn.Linear(dim_out, style_dim)]
    self.unshared = nn.Linear(dim_out, style_dim)

def forward(self, x, y):
    h = self.shared(x)
    h = h.view(h.size(0), -1)
    # n speaers encoder
    # for layer in self.unshared:
    #     out += [layer(h)]
    # out = torch.stack(out, dim=1)  # (batch, num_domains, style_dim)
    # idx = torch.LongTensor(range(y.size(0))).to(y.device)
    # s = out[idx, y]  # (batch, style_dim)
    s = self.unshared(h)
    return s

`

980202006 commented 2 years ago

Is it possible to add wavelet transform to the model, such as referring to the design of swagan's generator

yl4579 commented 2 years ago

@980202006 It's definitely possible to add wavelet transform to the model and it could theoretically make a big difference because the high-frequency content is what makes speech clear even though the mel-spectrogram looks visually the same. However, I can't say exactly how much high-frequency content is there in mel-spectrogram because the resolution of mel specs is usually very low and what vocoders do is exactly uncover the lost high-frequency information. I think fine-tuning with hifi-gan probably would do the same thing, but you can definitely try and see if it helps.

yl4579 commented 2 years ago

Back to the style encoder problem, so I think you removed the shared linear layers (N of them where N is the number of speakers) and replaced it with a single linear projection for every speaker. I have tried this approach too, but it seems like the style encoder has a hard time encoding the speaker characteristics and usually returns a style vector that sounds like a combination of seen speakers during training instead. However, if you use simple gradient descent to find the style that can reconstruct unseen speakers, it is usually possible to find such a style and it preserves most of the characteristics during reconstruction, exactly like what you have presented here. In fact, the style encoder sometimes even fails to find a style that reconstructs the seen speakers in my case. My hypothesis is that the shared projections lack the power to separate different speakers while unshared projections force the models to learn more about the speaker characteristics.

One way to verify this is to train a linear projection for each speaker that reconstructs the given input by fixing both the self.shared part of the style encoder and the generator, and retrain everything from scratch by only training the style encoder with the original recipe (i.e. use the unshared linear projections). If my hypothesis is correct, the style encoder trained with one linear projection will be worse than the one trained with N linear projections in terms of encoding speaker characteristics, and we can proceed from there if it is correct.

980202006 commented 2 years ago

In my model, I regard the style encoder as a speaker information extraction model, that is, it extracts the high-dimensional representation of the speaker from the mel instead of fitting a specific speaker vector space. I prefer to use points instead. Non-spatial to represent a speaker, which may result in the loss of some information. Because, I found that the original style encoder has an average pooling operation, which is very similar to x-vector or d-vector. The problem may be caused by your insufficient number of speakers. I used speech and singing data, with at least 70 speakers. I will try the effect of a single linear layer.

980202006 commented 2 years ago

Thank you!

Kristopher-Chen commented 2 years ago

Hi, guys! I appreciate your efforts towards any-to-any voice conversion task. In my understanding, when any input speakers are expected, enlarging the number of training speakers may help. In this case, more discriminators (specifically the classifiers) are designed for each group of speakers (like every 10 speakers).

Then my question is that the other parts, such as the style encoder, and generator (both encoder and decoder), need to be enlarged to cover more diversities? As far, it seems you are discussing about the style encoder part. But I'm not very clear about the modification as the discriminators do. Are similar changes of more style encoders for each group of speakers needed?

Two more fundamental problems: 1) it seems style encoder works better than mapping networks. The input is Gaussian white noise for the mapping network, and the network tries to model the speakers' style distribution. Maybe more dimensions of the input may improve its performance? Of course, in this period, the style encoder may be the primary concern.

2) I found some sound quality degradation in the synthesis results. Apart from the intelligibility, the harmonic phenomenon of the unvoice sound is quite apparent. But the copy-synthesis results of the vocoder also suffer. So I believe this is the problem of Parallel WaveGAN. Have you paid attention to this phenomenon?

Recently, I'm looking at the possibility of any-to-any VC applications. Hope to keep in touch! Thanks a lot!

Kristopher-Chen commented 2 years ago

Whether to remove the mapping network, if I only use it with reference audio.

hi, have you tried to remove the mapping network? Is this way effective for improving the quality?

Thanks a lot!

yl4579 commented 2 years ago

@980202006 70 speakers are definitely not enough, so increasing the number of speakers probably would help. Separate projections may or may not make a difference and that's what I think we may want to figure out. In my experience, it did make a big difference, but it might also be because I didn't train with long enough iterations for the cycle loss to converge (probably removing speaker individual projections slows down convergence, and the power of the style encoder is not actually compromised).

yl4579 commented 2 years ago

@Kristopher-Chen Thanks for your comments, I will try to address these problems to the best of my knowledge.

The encoder part of the generator is style-free, which means that it does not depend on the speaker. The point of the encoder is to extract all aspects of the input speech, such as phonemes, prosody, and energy, and encode them in a speaker-agnostic state via instance normalization. Each channel ideally encodes some aspect of the input speech, and for a single aspect of the speech, there could be multiple redundant channels representing the same information but will be used later by the decoder depending on the speakers.

The point of the decoder is to put these speaker-agnostic features back into melspectrum of some speakers via AdaIN. AdaIN is the same as instance normalization, i.e. one normalizes each channel by some mean and standard deviation computed from the input style vector. Here we add back the missing speaker information from the encoder and gradually go back to the melspectrum of the target speaker.

Whether to increase the number of channels of the generator depends on whether you think all aspects of human speech can be represented by 512 20 x N feature maps (where N is the input length / 4). This is the current setting of the shape of encoder output. I believe this is fair enough because I have seen a lot of redundant feature maps from the encoder output (which potentially leaks the speaker information), but you may find it helpful to increase the number of layers or channels depending on your needs.

Back to the style encoder, a lot of papers in speech synthesis have shown that there are only 4 degrees of freedom in human speech, which are phonemes (w/ duration), pitch, speaker identity, and energy. The encoder has provided us the phonemes, pitch, and energy, the decoder only needs to figure out how these things change with the speaker. This is simple for seen speakers because we have already seen them during training and even one-hot encoding can work as the style vectors. For unseen speakers, however, it requires the style vectors (speaker embedding) to form a smooth manifold in terms of speaker identity for the generator. There's no guarantee the style encoder will learn such a smooth manifold and increasing the capacity of the model can only make it overfit more on seen styles. I think the direction should be on regularizing the style encoder side rather than giving it more parameters.

The mapping network is just sampling on the learned manifold, so it basically randomly picks a point in the learned speaker space for a particular speaker. It cannot be used for zero-shot conversion, and it usually has limited expressiveness. It can be removed completely without affecting the performance.

As for the "harmonic phenomenon of the unvoice sound", I'm not so sure what you are referring to, can you provide some examples for analysis?

Kristopher-Chen commented 2 years ago

Hi, could you please refer to this file, where the last unvoiced /s/ sounds harmonic? https://drive.google.com/file/d/10ORxlHx9QlZcxBkmeSAtUoRzdL9uiS76/view?usp=sharing

980202006 commented 2 years ago

I checked my code and found that I did not use adversarial classifier loss. The point is new_y_trg in the code. ' if use_adv_cls:

out_de = nets.discriminator.classifier(x_fake)

    new_y_trg = y_trg % domain_per_dis
    out_de = nets.discriminator.multi_classifier(x_fake, new_y_trg)
    loss_real_adv_cls = F.cross_entropy(out_de[y_org != y_trg], new_y_trg[y_org != y_trg]) 
    out_de[y_org != y_trg]

' This may be the reason why the style encoder was successfully trained. I am trying to make new_y_trg to new_y_org.

Kristopher-Chen commented 2 years ago

@yl4579 I also tried to train with 2 speakers. The similarity is obviously improved with some degradation in sound quality (probably due to lack of training data). However, the pitch is still away from target speakers, or to say, the female sounds a little dumb, and the male sounds brighter than the targets. And, as expected, unseen speakers/ performance degraded significantly.

I compared with seen and unseen speakers' similarity with target speakers (dataset with 20 speakers), and I thought the similarity is not as supposed, with leaking source speaker's features, especially the pitch is not that consistent, as it seems to be a smooth or average of all the speakers.

So, to summarize, fewer training speakers improve similarity with some degradation in sound quality and leads to poor unseen speakers' performance. More speakers work the opposite way, improve the sound quality but degrades the similarity, and get better consistent performance for unseen speakers but far from ideal. And one more problem is the pitch similarity.

I think improving the similarity is the first step and then enlarging the training data. I tried the ideas in 'pitchnet' to disentangle the speaker and pitch information from the encoder output, but seems not working well in 2 speakers datasets. And I'm wondering how to direct the decoder to synthesize pitch as the references are only source's f0 information and target style encoder, which means no direct pitch information of the targets.

How to better disentangle the source speakers' information and how to improve the pitch similarities? Hope to hear from you!

yl4579 commented 2 years ago

@Kristopher-Chen One way to do that is to feed in only the normalized F0 curve (not the 512-channel F0 features). The normalized F0 curve theoretically contains no speaker information because it doesn't contain the absolute pitch in Hz, but only a curve. On the other hand, the 512-channel F0 features used in the original implementation do leak a lot of input speaker information, and you can use a simple AlexNet to classify the input speaker using only this feature to get more than 90% classification accuracy. I've tried this modification and it seems to work very well for both singing and speech. It was not used in the original paper because I only tried it after the paper got published.

980202006 commented 2 years ago

The encoder also has speaker information leakage. I trained a resnet classifier, which can correctly classify 80% of the encoder output. I tried adversarial speaker normalization, but I did not feel the sound quality has improved.

Kristopher-Chen commented 2 years ago

@yl4579 @980202006 How do you train the classifier? Only using the encoder or f0 networks' output (after training) and then training the classifier? Or using the Alexnet used for objective evaluation? The input of Alexnet in the paper is the spectrum or the waveform, but the input features for the encoder outputs are not the same. Or using something like the discriminator, which is updated during the primary training process?

yl4579 commented 2 years ago

@980202006 The speaker leakage is expected but it should be fixed by the adversarial classifier using the decoder output. The idea here is that it doesn't matter what the encoder encodes, the conversion is successful as long as you can't tell who the source is from the decoder. I tried to remove the speaker information from the encoder, but the intelligibility decreases significantly. The input speaker information can be kept to some extent as long as the output is perceptually unrelated to the input.

@Kristopher-Chen The input to the Alexnet is the melspectrogram in the paper for the objective evaluation. What I meant here was that the F0 features fed into the decoder contain too much information (much more than just F0), so it leaks the speaker information. It was verified by me training an Alexnet using these F0 features to classify the source speakers and got an accuracy above 90%.

980202006 commented 2 years ago

@Kristopher-Chen http://www.interspeech2020.org/uploadfile/pdf/Mon-2-7-2.pdf I refer to this paper, which is the simplest implementation, compared to other papers like https://arxiv.org/pdf/2111.12277.pdf

980202006 commented 2 years ago

Thank you!I think the model results are not good enough because the discriminator is not strong enough. I applied projectgan to the asr, f0 model but didn't get good enough results. @yl4579

980202006 commented 2 years ago

In addition, I noticed that there will be similarity problems between cross-domain speakers. For example, I train on Chinese singing data, but if the target speaker is English speaker from ljspeech, the similarity will be very low.

980202006 commented 2 years ago

If it is an any to any scenario, adversarially detecting whether there is original speaker information at the output of the decoder may make it difficult for the model to converge. Can other strategies be used?

Kristopher-Chen commented 2 years ago

@yl4579 One question about the vocoder. I just evaluate the copy synthesis results of the pre-trained models from https://github.com/kan-bayashi/ParallelWaveGAN and find the MOS is similar for PWG and MB-MelGAN, but the speed is much faster for MB-Melgan. An interesting thing is that later papers like Hifi-GAN seldom compare with MB-MelGAN, any reasons behind this?

Kristopher-Chen commented 2 years ago

@980202006

In addition, I noticed that there will be similarity problems between cross-domain speakers. For example, I train on Chinese singing data, but if the target speaker is English speaker from ljspeech, the similarity will be very low.

@980202006 How are you doing any to any mappings? Did you replace the original style encoder to some one-shot encoders like x-vector now?

Kristopher-Chen commented 2 years ago

@Kristopher-Chen One way to do that is to feed in only the normalized F0 curve (not the 512-channel F0 features). The normalized F0 curve theoretically contains no speaker information because it doesn't contain the absolute pitch in Hz, but only a curve. On the other hand, the 512-channel F0 features used in the original implementation do leak a lot of input speaker information, and you can use a simple AlexNet to classify the input speaker using only this feature to get more than 90% classification accuracy. I've tried this modification and it seems to work very well for both singing and speech. It was not used in the original paper because I only tried it after the paper got published.

@yl4579 If replacing the f0 network with the normalized f0 curve, we should expand the dimension too to concatenate the content encoder output? Something like using ResBlocks to make the input of (batchsize, 1, meldim, frames) to the output of (batchsize, 256, 10, frames)?

980202006 commented 2 years ago

@980202006

In addition, I noticed that there will be similarity problems between cross-domain speakers. For example, I train on Chinese singing data, but if the target speaker is English speaker from ljspeech, the similarity will be very low.

@980202006 How are you doing any to any mappings? Did you replace the original style encoder to some one-shot encoders like x-vector now?

I just use mel as input and remove the 'y'. X vector can also be the style encoder.

yl4579 commented 2 years ago

@Kristopher-Chen No, it should just be 1-d. I have a new paper submitted with this idea, so if it's accepted I will disclose more information about it, but the basic idea is you only use the normalized F0 curve as the input.

980202006 commented 2 years ago

@yl4579 Could you please tell me the title of the paper, when it is released, I want to pay attention.

skol101 commented 2 years ago

@Kristopher-Chen were you able to train MB-MelGAN from the mentioned repo? How did you manage to train it so it works with StarGanV2-VC? I can see that the main issue is with preprocess incompatibility.

ZhaoZeqing commented 2 years ago

@yl4579 @Kristopher-Chen How to extract F0 curve? I extract F0 by pyworld DIO and get the silence features only from F0, I also normalized F0, is it possible to do so? Looking forward to your reply:)