yxlllc / DDSP-SVC

Real-time end-to-end singing voice conversion system based on DDSP (Differentiable Digital Signal Processing)
MIT License
1.81k stars 240 forks source link

Which Units_Encoder is preferable? #36

Closed MuruganR96 closed 1 year ago

MuruganR96 commented 1 year ago

Hi, @yxlllc Great work. Thank you

'hubertsoft', 'hubertbase', 'hubertbase768', 'contentvec', 'contentvec768' or 'contentvec768l12'

To balance the problem of content information loss and timbre leakage, Which Units_Encoder is preferable?

yxlllc commented 1 year ago

The experiment is still in progress, and the preliminary conclusion is that contentvec768L12 may have the best timbre restoration, but if the goal is to change the voice in real time, hubertsoft has a better effect, and the articulation is obvious clearer.

MuruganR96 commented 1 year ago

Thank you @yxlllc Some generic questions, i need your help to clarify. Please give your suggestions

Few Questions:

  1. How to solve treble problem (Highly Expressive Shouting dialogues) for conversion?
  2. staccato problem how to solve?
  3. How many speakers are preferable to build Multispeaker model?
yxlllc commented 1 year ago

I think the DDSP method is good at solving treble and staccato problems, this is because the dsp synthesis module introduces a strong dependence on f0. It can ensure that the converted vocal and the original vocal have exactly the same pitch, which is why it is called “singing” voice conversion (SVC). I think the difficulty of synthesizing "highly expressive shouting dialogue" can be considered to be close to that of singing voice synthesis.

In fact, the more advanced Diff-SVC and So-VITS-SVC projects mentioned in the readme both use NSF-HifiGAN as the vocoder. NSF (Neural Source Filter) can actually be considered a type of DDSP: it also introduces a strong dependence on f0. NSF-HifiGAN can be said to perfectly solve the problem of the original HifiGAN staccato, so it is very suitable for singing.

This project also uses the pre-trained NSF-HifiGAN vocoder as the enhancer of the original ddsp output (the original DDSP only use a simple stft loss without gan loss, so the sound quality is not very ideal).

Of course, if you want to continue to improve the expressiveness, it is necessary to model the acoustic model with a generative model as well, so the upper limit of pure DDSP will not be very high. For example, the diffusion model based Diff-SVC project will have a much higher sound quality than this project after sufficient training.

For the last question, it is better not to have too many speakers (5 or so?), otherwise strong timbre leakage will be observed.

MuruganR96 commented 1 year ago

Thank you @yxlllc