yl4579 / StarGANv2-VC

StarGANv2-VC: A Diverse, Unsupervised, Non-parallel Framework for Natural-Sounding Voice Conversion
MIT License
466 stars 110 forks source link

Doubts about sampling rate #10

Closed 980202006 closed 2 years ago

980202006 commented 2 years ago

I noticed that the torchaudio.transforms.MelSpectrogram you used is 16000 sampling rate, but the wav read is 24000 sampling rate. In other words, you use a mel with a sampling rate of 16,000 for audio with a sampling rate of 24,000 and use it as the target.Will this affect?

yl4579 commented 2 years ago

It doesn't matter, because, in the end, the results are simply that you downsample the audio into 16 kHz, as setting sr = 16,000 means you map your audio into a range of 0 to 8 kHz. However, since ParallelWaveGAN is able to upsample, i.e. it can synthesize waveforms of higher sampling rate even if the input melspectrograms are of lower sampling rate, it barely has any effects in the generated audios.

980202006 commented 2 years ago

Thank you!

980202006 commented 2 years ago

Putting aside the vocoder, what does the sampling rate affect in the mel calculation? The 16000 sampling rate mel calculates the 24000 wav file, and whether the calculation result is the usable mel spectrum

yl4579 commented 2 years ago

I think this will be helpful: https://stackoverflow.com/questions/57053654/why-my-8khz-wav-files-mel-feature-extracted-differently-in-sr-16khz-and-44-1k

980202006 commented 2 years ago

That's great. Thank you!

iehppp2010 commented 2 years ago

It doesn't matter, because, in the end, the results are simply that you downsample the audio into 16 kHz, as setting sr = 16,000 means you map your audio into a range of 0 to 8 kHz. However, since ParallelWaveGAN is able to upsample, i.e. it can synthesize waveforms of higher sampling rate even if the input melspectrograms are of lower sampling rate, it barely has any effects in the generated audios.

I want to know how you trained the ParallelWaveGAN vocoder model that you provided in the README link. Is it trained with 16kHz audio melspectrograms, then synthesize 24KHz audio?

image

yl4579 commented 2 years ago

@iehppp2010 No, the STFT was done with 24 kHz audio and mel-scale was done with a 16 kHz scale. See #8 for preprocessing.