yl4579 / StarGANv2-VC

StarGANv2-VC: A Diverse, Unsupervised, Non-parallel Framework for Natural-Sounding Voice Conversion
MIT License
466 stars 110 forks source link

Any to many vc, how to improve the speech intelligence for arbitrary inputs? #51

Open Kristopher-Chen opened 2 years ago

Kristopher-Chen commented 2 years ago

When testing arbitrary inputs for any-to-many vc cases, the speech intelligence sometimes drops, where some phonemes cannot be well pronounced or sounds blur. It seems there are no other explicit constraints in this except for the ASR (or more specifically, PPG) loss. Any ideas to improve this problem?

yl4579 commented 2 years ago

I believe it could simply be because there's not enough training data. Any-to-many conversion requires a lot of input data for the model to generalize well.

skol101 commented 2 years ago

And to generealise well, we need to have multiple discriminators (e.g. 1 per 10 speakers), as discussed in another topic?

Kristopher-Chen commented 2 years ago

I believe it could simply be because there's not enough training data. Any-to-many conversion requires a lot of input data for the model to generalize well.

I have already used 200 speakers, each of which is around 15~20 minutes...

yl4579 commented 2 years ago

@skol101 I don't believe so, if it's not for any-to-any, you only need to have a lot of input speakers. You do not need cycle loss in this case, because you don't really need these many output speakers. One thing you can do is modify the cycle loss to match the encoder output instead of the decoder output (i.e., the same speech should have the same encoded representations before and after conversion).

@Kristopher-Chen You say sometimes drop, so what are the cases that drop and what are the cases that are good?

skol101 commented 2 years ago

@yl4579 cheers! How about shared projection as per https://github.com/yl4579/StarGANv2-VC/issues/6#issuecomment-968460007 ? Is it applicable to any-to-many conversion?

yl4579 commented 2 years ago

@skol101 I don't think you need this either, it is to make the style encoder speaker-independent so you can convert to any output speaker. If you are only interested in any-to-many, this is not necessary.

Kristopher-Chen commented 2 years ago

@skol101 I don't believe so, if it's not for any-to-any, you only need to have a lot of input speakers. You do not need cycle loss in this case, because you don't really need these many output speakers. One thing you can do is modify the cycle loss to match the encoder output instead of the decoder output (i.e., the same speech should have the same encoded representations before and after conversion).

@Kristopher-Chen You say sometimes drop, so what are the cases that drop and what are the cases that are good?

@yl4579 https://drive.google.com/drive/folders/1SGBJllEvWg9a70qJf5DZhTVT_E5bl-w0 I trained with 200 Chinese speakers. There is an example here. The input is recorded by PC, and I tried to convert it to one male and one female by 50 and 120 epochs. The points are, 1) the male sounds noisier than the female; 2) I mean the speech intelligence got worse when something happens like the 120 epoch outputs. But the interesting thing is that the results of the 50 epoch seem better than those of the 120 epoch. I just could not figure it out.

thsno02 commented 1 year ago

I have tried a any-to-many mapping solely based on this amazing project, and it works well for some speakers instead of all. I used 10 speakers and 20 mins audio per speaker, and the hyperparameters are the same as the original.

At epoch 248, two speakers work excellently, both of them can convert Sichuan dialect Chinese though they are trained in Mandarin, and both of them can handle any-to-many conversion task.

At epoch 466, I get 5 speakers who work perfectly and the conversion quality for all speakers has been promoted a lot.

From my experience, u can keep training and wait. The training data is vitial important for this task, the high data quality prones to better performance of the speakers. However, the quality can't guarantee a better performance since I use both Lijian Zhao and Chunying Hua as the speakers and Chunying Hua works well at epoch 248 while Lijian Zhao not.

skol101 commented 1 year ago

@thsno02 what vocoder have you used?

thsno02 commented 1 year ago

@skol101 the original one, and use mapping network rather than style

skol101 commented 1 year ago

Interesting, it was reported elsewhere that style encoder is better at VC than mapping network.

Also, you haven't fine tuned the vocoder to your dataset?

thsno02 commented 1 year ago

I haven’t tried any fine tuning effort due to the time frame. I did a lot experiments about the model performance, my conclusion is the mapping network tends to perform better in any-to-many task than style network, while styple network would convert audio with more linguistic information and more fluent sometimes. Meanwhile, in my scenario, either mapping network or style network can convert the audio in high quality consistently. This phenomenon kills me, and I have not figured it out.

There have many potential reasons for this:

Tips: I have trained 742 epoches, while the model generilization does not change and I still only get 2 available speakers.

skol101 commented 1 year ago

Have you trained them (mapping and style) both or separately?

thsno02 commented 1 year ago

both

1nlplearner commented 1 year ago

@Kristopher-Chen how many domains in you discriminator and how many discriminators

1nlplearner commented 1 year ago

@skol101 I don't believe so, if it's not for any-to-any, you only need to have a lot of input speakers. You do not need cycle loss in this case, because you don't really need these many output speakers. One thing you can do is modify the cycle loss to match the encoder output instead of the decoder output (i.e., the same speech should have the same encoded representations before and after conversion).

@Kristopher-Chen You say sometimes drop, so what are the cases that drop and what are the cases that are good?

so, do i need compute loss of encoder output before add F0? what is the function of f0