question about model size and result

winddori2002 / TriAAN-VC

TriAAN-VC: Triple Adaptive Attention Normalization for Any-to-Any Voice Conversion

MIT License

129 stars 12 forks source link

question about model size and result #17

Closed JeffC0628 closed 11 months ago

JeffC0628 commented 11 months ago

thanks for your excellent work.

I got noise from vc result, what is the reason for making the noise? and how could I remove that? In addition, I use the full-size the audio sample to train the model instead of the 128-n_frames slice. vc result: https://github.com/winddori2002/TriAAN-VC/assets/24654967/795df989-9a2a-4812-8d84-764c19698665
i notice the parameters of the speaker & content encoder, the output channel is 4, i wonder to know why you chose so smaller numbers. or have you tried any other numbers like 128 or 256?

winddori2002 commented 11 months ago

I also found larger n_frame degrades the performance, thus I guess adjusting the n_frame is required. However, since the sample contains too much noise, it is necessary to check other possibilities.

The bottleneck conversion process is important to represent the target speaker since it makes initial conversion representations. When I used a small channel size (4), it tends to represent more target speaker information. I also tried larger channel sizes, but it tends to have less speaker information. Based on the empirical results, I used a channel size of 4.

JeffC0628 commented 11 months ago

Thanks for your reply. I found that a small n_frame will generate kinds of speech noise in the silence, and those noises also be mixed into the speech, so I used the whole audio sample for training. In addition, I’m more curious about how these speech noises are generated. which module makes these noises? looking forward to your reply

winddori2002 commented 11 months ago

If there are some mixed noises, the conversion modules can yield them. However, in general, those kinds of noises would be generated by settings and pre- or post-processing steps. (Since the noises in the sample you provided sound a lot.)