yl4579 / HiFTNet

HiFTNet: A Fast High-Quality Neural Vocoder with Harmonic-plus-Noise Filter and Inverse Short Time Fourier Transform
MIT License
115 stars 11 forks source link

Comparison to Vocos? #1

Open Ryu1845 opened 9 months ago

Ryu1845 commented 9 months ago

Hi! I wanted to know if you know about Vocos and if you compared to it since it uses similar principles and has similar results.

yl4579 commented 9 months ago

Thanks for letting me know this work, though I actually didn’t know this work beforehand so I didn’t compare to this one. I think it’s still quite different from Vocos because in this work we optimize quality over speed, while Vocos optimizes speed over quality.

In the paper, the author shows that Vocos is four times faster than iSTFTNet with comparable performance to BigVGAN-base (I believe the BigVGAN in the paper refers to BigVGAN-base because BigVGAN has 114M parameters though the paper shows it only has 14M parameters), while our work is nearly twice slower than iSTFTNet but significantly outperforms BiGVGAN-base with comparable performance to BigVGAN.

I tried Vocos myself and perceptually it sounds slightly worse than HiFTNet, but it’s indeed much much faster. I think one big advantage of HiFTNet is it works well for singing synthesis while Vocos lags behind because it does not have the hn-NSF. But overall if you care more about the speed Vocos is definitely a much better choice.

Vocos: https://drive.google.com/file/d/1GTZaNlukv0jkNStPJ644oD1s2RJ2GEZW/view?usp=sharing HiFTNet: https://drive.google.com/file/d/1Phu9Z3Q55L08uWd3RKw9q3rVT3DrczWe/view?usp=sharing BigVGAN (not base): https://drive.google.com/file/d/1r-qYcRqk7Qt90Ik55msVlwKhyjcsL787/view?usp=sharing

yl4579 commented 9 months ago

I will leave this issue open if someone is interested in comparing it to Vocos. Probably for singing synthesis, someone can combine these two together to make a fast high-quality singing vocoder. I developed this vocoder primarily for singing voice conversion with SLMGAN but there's no singing data to actually compare so I just compared on LJSpeech and LibriTTS instead.

Ryu1845 commented 9 months ago

Thank you for your quick reply! This is a great comparison. I can definitely see your work being better for singing synthesis, considering it uses NSF. I'm looking forward to an eventual fast HQ singing vocoder!

yl4579 commented 9 months ago

I think it’s a good idea. I’ll try to combine these two and test its performance against vocos and see if it’s better but with significant speed improvements. If it works well I’ll add it to the paper later.

yl4579 commented 9 months ago

I have tried to incorporate hn-NSF to vocos but the quality is worse than without it. I think it could be related to how the source should be fed into the model (like STFT before feeding it). It doesn't seem a trivial task so more experiments are needed. If anyone else has time please take a look at it.

TechInterMezzo commented 8 months ago

I think one big advantage of HiFTNet is it works well for singing synthesis while Vocos lags behind because it does not have the hn-NSF.

The 22kHz of the pretrained HiFTNet models are a bit low for my purpose. I think 32kHz is what I would need. Vocos also only supports 24kHz. Would you recommend retraining with different parameters or using some kind of upsampling model at the end? Speed is not so important in my case.

yl4579 commented 8 months ago

@TechInterMezzo If speed is not a concern, I would recommend you just train an NSF-BigVGAN with the current setup (i.e., a pre-trained F0 network to extract F0). Basically you add NSF to BigVGAN with F0 extracted using a pre-trained F0 network on mel-spectrograms.

bzp83 commented 2 weeks ago

Hi, what vocoder do you think has the best quality for 44100hz wave output? thank you!