Closed meyerdav closed 2 years ago
Sorry for the late reply. I was pretty busy at the end of my semester. The reason was to keep the receptive field more local to make real-time conversion possible. If you downsample in the time axis for 16 times instead 4 times, the results will be less local and you will need a longer buffer to keep the time consistency when doing real-time conversion. It may produce better results if you are not interested in real-time applications.
@yl4579 increasing the latent_dim dimension will lead to better results when using mapping network?latent_dim: 64
Is there a reason why you maintain parts of the temporal resolution (by sometimes only downsampling the frequency direction)? What would you assume would happen if instead of (5,48), the latent representation had a dimension of (5,5)?