Closed auspicious3000 closed 1 year ago
I haven't updated the paper with the camera-ready version so it is indeed quite unclear. These parameters are only for LJSpeech dataset instead of LibriTTS, as you have pointed out it's indeed too time-consuming to train it for 200 epochs (and no improvements are seen after around 80 epochs). You should refer to the parameters of the released checkpoints instead (though they aren't the exact checkpoints I used for the paper).
Hi,
Thanks for releasing the code of such a great work.
Could you kindly clarify the specific number of epochs and batch size you employed during the LibriTTS model training?
The paper says batch size of 64 and 200 epochs for 1st stage training and 100 epochs for 2nd stage training. The config.yml indicates the same except for a batch size of 32. The config.yml under the pretrained "Models" directory indicates 80 epochs for the 1st stage and 50 epochs for the 2nd stage for LibriTTS. These disparities have raised questions regarding the precise training parameters used to achieve the results presented in the paper for LibriTTS. Training for 200 epochs with a batch size of 64 on the LibriTTS dataset can be quite time-consuming without access to a powerful GPU. It would be very helpful if a reduced number of epochs and batch size could achieve comparable results.
Looking forward to your reply! @yl4579
Warm regards