Some questions about implementation details

TOGA101 commented 1 year ago

1.Does SLMGAN use R1 regularization like StarGANv2-VC? If so, how to apply it? If not, should I use any other techniques to stabilize the training of GAN? 2.After epoch 20, the SLM-based discriminator joins training. At this time, does the Mel-based discriminator need to exit training? 3.When using WavLM as training objectives, what should I pass in to the attention_mask parameter of its forward function? Leave it alone or generate masks based on the length of the source audio? 4.Does increasing the number of speakers during training rather than just the 89 speakers in VCTK lead to better performance? 5.When will the official implementation be released? Thank you.

yl4579 commented 1 year ago

It uses R1 just as before. The discriminators are exactly the same as those in StarGANv2-VC, just the inputs are replaced with WavLM features.
No, the mel discriminator is still needed because the WavLM was trained in 16k Hz while the dataset was in 22.5k Hz. If you remove the mel discriminator, the generated samples would have large distortions in frequencies above 16k Hz as they weren’t discriminated.
I left it alone because I didn’t pad the clips in the dataloader.
Yes, but you also have to increase the model size and hidden size (like style vector size from 64 to 256 and number layers from 3 to 6), otherwise the performance isn’t any better or even worse.
It will be released after I finish the StyleTTS 2 code which I’m currently having troubles fixing.

TOGA101 commented 1 year ago

OK, thanks.

yl4579 / SLMGAN

Some questions about implementation details #1