Closed JeffC0628 closed 11 months ago
I also found larger n_frame degrades the performance, thus I guess adjusting the n_frame is required. However, since the sample contains too much noise, it is necessary to check other possibilities.
The bottleneck conversion process is important to represent the target speaker since it makes initial conversion representations. When I used a small channel size (4), it tends to represent more target speaker information. I also tried larger channel sizes, but it tends to have less speaker information. Based on the empirical results, I used a channel size of 4.
Thanks for your reply. I found that a small n_frame will generate kinds of speech noise in the silence, and those noises also be mixed into the speech, so I used the whole audio sample for training. In addition, I’m more curious about how these speech noises are generated. which module makes these noises? looking forward to your reply
If there are some mixed noises, the conversion modules can yield them. However, in general, those kinds of noises would be generated by settings and pre- or post-processing steps. (Since the noises in the sample you provided sound a lot.)
thanks for your excellent work.