yl4579 / StarGANv2-VC

StarGANv2-VC: A Diverse, Unsupervised, Non-parallel Framework for Natural-Sounding Voice Conversion
MIT License
466 stars 110 forks source link

Attempting live inference. #15

Closed Pathos0925 closed 2 years ago

Pathos0925 commented 2 years ago

Would you have any optimization tips for using this with live audio? I am processing 200 ms chunks at a time. The f0 model was performing poorly with just that so I now append it to the last 2~ seconds first then chop the end after converting which gives better results, but there are still some artifacts. How can I get the end of one converted frame to match up closer to the start of the next? Right now it sounds like this https://drive.google.com/file/d/1TTvrysYW4khnQ7eJVDpJrS0OCobV4O0o/view?usp=sharing The spectrogram: https://i.imgur.com/LC1YbLI.png

Thank you!

yl4579 commented 2 years ago

I think this is the common boundary effect in real-time audio processing caused by incorrect phase alignment. The easiest way to deal with it is a weighted average of the part that it misaligns, maybe just a few hundreds of samples. It can be realized with the following code assuming you use a window of 300 samples:

buffer_weight = np.linspace(1, 0, 300)
new_wave[0:300] = buffer_weight * prev_wave[-300:-1] + (1 - buffer_weight) * new_wave[0:300]
Pathos0925 commented 2 years ago

This has provided a significant improvement in the quality, thank you!

dragen1860 commented 2 years ago

@Pathos0925 Hi, I am also interested in live inference demo. Could you kindle share your live inference code? thank you .