Open Ryu1845 opened 9 months ago
Thanks for letting me know this work, though I actually didn’t know this work beforehand so I didn’t compare to this one. I think it’s still quite different from Vocos because in this work we optimize quality over speed, while Vocos optimizes speed over quality.
In the paper, the author shows that Vocos is four times faster than iSTFTNet with comparable performance to BigVGAN-base (I believe the BigVGAN in the paper refers to BigVGAN-base because BigVGAN has 114M parameters though the paper shows it only has 14M parameters), while our work is nearly twice slower than iSTFTNet but significantly outperforms BiGVGAN-base with comparable performance to BigVGAN.
I tried Vocos myself and perceptually it sounds slightly worse than HiFTNet, but it’s indeed much much faster. I think one big advantage of HiFTNet is it works well for singing synthesis while Vocos lags behind because it does not have the hn-NSF. But overall if you care more about the speed Vocos is definitely a much better choice.
Vocos: https://drive.google.com/file/d/1GTZaNlukv0jkNStPJ644oD1s2RJ2GEZW/view?usp=sharing HiFTNet: https://drive.google.com/file/d/1Phu9Z3Q55L08uWd3RKw9q3rVT3DrczWe/view?usp=sharing BigVGAN (not base): https://drive.google.com/file/d/1r-qYcRqk7Qt90Ik55msVlwKhyjcsL787/view?usp=sharing
I will leave this issue open if someone is interested in comparing it to Vocos. Probably for singing synthesis, someone can combine these two together to make a fast high-quality singing vocoder. I developed this vocoder primarily for singing voice conversion with SLMGAN but there's no singing data to actually compare so I just compared on LJSpeech and LibriTTS instead.
Thank you for your quick reply! This is a great comparison. I can definitely see your work being better for singing synthesis, considering it uses NSF. I'm looking forward to an eventual fast HQ singing vocoder!
I think it’s a good idea. I’ll try to combine these two and test its performance against vocos and see if it’s better but with significant speed improvements. If it works well I’ll add it to the paper later.
I have tried to incorporate hn-NSF to vocos but the quality is worse than without it. I think it could be related to how the source should be fed into the model (like STFT before feeding it). It doesn't seem a trivial task so more experiments are needed. If anyone else has time please take a look at it.
I think one big advantage of HiFTNet is it works well for singing synthesis while Vocos lags behind because it does not have the hn-NSF.
The 22kHz of the pretrained HiFTNet models are a bit low for my purpose. I think 32kHz is what I would need. Vocos also only supports 24kHz. Would you recommend retraining with different parameters or using some kind of upsampling model at the end? Speed is not so important in my case.
@TechInterMezzo If speed is not a concern, I would recommend you just train an NSF-BigVGAN with the current setup (i.e., a pre-trained F0 network to extract F0). Basically you add NSF to BigVGAN with F0 extracted using a pre-trained F0 network on mel-spectrograms.
Hi, what vocoder do you think has the best quality for 44100hz wave output? thank you!
Hi! I wanted to know if you know about Vocos and if you compared to it since it uses similar principles and has similar results.