yl4579 / StarGANv2-VC

StarGANv2-VC: A Diverse, Unsupervised, Non-parallel Framework for Natural-Sounding Voice Conversion
MIT License
466 stars 110 forks source link

Latent representation #47

Closed meyerdav closed 2 years ago

meyerdav commented 2 years ago

Is there a reason why you maintain parts of the temporal resolution (by sometimes only downsampling the frequency direction)? What would you assume would happen if instead of (5,48), the latent representation had a dimension of (5,5)?

yl4579 commented 2 years ago

Sorry for the late reply. I was pretty busy at the end of my semester. The reason was to keep the receptive field more local to make real-time conversion possible. If you downsample in the time axis for 16 times instead 4 times, the results will be less local and you will need a longer buffer to keep the time consistency when doing real-time conversion. It may produce better results if you are not interested in real-time applications.

MuruganR96 commented 1 year ago

@yl4579 increasing the latent_dim dimension will lead to better results when using mapping network?latent_dim: 64