winddori2002 / TriAAN-VC

TriAAN-VC: Triple Adaptive Attention Normalization for Any-to-Any Voice Conversion
MIT License
129 stars 12 forks source link

changing window size and n_fft #7

Closed Souvic closed 1 year ago

Souvic commented 1 year ago

I tried using other vocoders like BigVGAN trained on combined data or HiFiGAN or even PWG from https://github.com/kan-bayashi/ParallelWaveGAN . All outputs are noisy. Why windowsize, nfft and hoplength are so important? The output should not vary too much right as this is just calculating fft in different short time windows only which should preserve the frequency range more or less with resolution difference in time/frequency scape only?

winddori2002 commented 1 year ago

Hi,

The trainable model parameters are affected by X and Y. For example, the objective of the vocoder is reconstructing waveforms (Y) from X (mel-spectrogram). The parameters of the vocoder will be trained to extract the representation of X (the mel-spectrogram with the specific setting). That's why the trained vocoder works on the specific settings (n_fft, window_length). It is the same for VC model, thus it should be matched the resolution of the mel-spectrogram between VC model and Vocder.

Souvic commented 1 year ago

Okay, thanks for clarifying..