Closed Souvic closed 1 year ago
Hi,
The trainable model parameters are affected by X and Y. For example, the objective of the vocoder is reconstructing waveforms (Y) from X (mel-spectrogram). The parameters of the vocoder will be trained to extract the representation of X (the mel-spectrogram with the specific setting). That's why the trained vocoder works on the specific settings (n_fft, window_length). It is the same for VC model, thus it should be matched the resolution of the mel-spectrogram between VC model and Vocder.
Okay, thanks for clarifying..
I tried using other vocoders like BigVGAN trained on combined data or HiFiGAN or even PWG from https://github.com/kan-bayashi/ParallelWaveGAN . All outputs are noisy. Why windowsize, nfft and hoplength are so important? The output should not vary too much right as this is just calculating fft in different short time windows only which should preserve the frequency range more or less with resolution difference in time/frequency scape only?