Closed 980202006 closed 2 years ago
It doesn't matter, because, in the end, the results are simply that you downsample the audio into 16 kHz, as setting sr = 16,000 means you map your audio into a range of 0 to 8 kHz. However, since ParallelWaveGAN is able to upsample, i.e. it can synthesize waveforms of higher sampling rate even if the input melspectrograms are of lower sampling rate, it barely has any effects in the generated audios.
Thank you!
Putting aside the vocoder, what does the sampling rate affect in the mel calculation? The 16000 sampling rate mel calculates the 24000 wav file, and whether the calculation result is the usable mel spectrum
That's great. Thank you!
It doesn't matter, because, in the end, the results are simply that you downsample the audio into 16 kHz, as setting sr = 16,000 means you map your audio into a range of 0 to 8 kHz. However, since ParallelWaveGAN is able to upsample, i.e. it can synthesize waveforms of higher sampling rate even if the input melspectrograms are of lower sampling rate, it barely has any effects in the generated audios.
I want to know how you trained the ParallelWaveGAN vocoder model that you provided in the README link. Is it trained with 16kHz audio melspectrograms, then synthesize 24KHz audio?
@iehppp2010 No, the STFT was done with 24 kHz audio and mel-scale was done with a 16 kHz scale. See #8 for preprocessing.
I noticed that the torchaudio.transforms.MelSpectrogram you used is 16000 sampling rate, but the wav read is 24000 sampling rate. In other words, you use a mel with a sampling rate of 16,000 for audio with a sampling rate of 24,000 and use it as the target.Will this affect?