I notice the original implementaion of fastspeech(integrated in ESPNet) adopts log domain to calculate the duration loss, which means target duration is first token the logarithm. In your version, the linear domain is used to directly calculate duration loss. Have you any ideas on both methods?
I notice the original implementaion of fastspeech(integrated in ESPNet) adopts log domain to calculate the duration loss, which means target duration is first token the logarithm. In your version, the linear domain is used to directly calculate duration loss. Have you any ideas on both methods?