Issue when training the model for longer epochs

I'm fine-tuning the released checkpoint (~190M) on a new language for plain_tts task. The initial results so far look good as the voice sounds natural, but still not satisfying. I would greatly appreciate your assistance in suggesting hyperparameters for finetuning the model or providing guidance on aspects like dataset size in finetuning. Additionally, any insights you may have noticed during the fine-tuning process would be invaluable.

I'm training the model for longer epochs to improve the results (more than 15 epochs). However, the generated audio shows anomalies, such as repetitive words or missing phonemes, despite the correct script input. I suspect this may be attributed to overfitting. To address this, I set the dropout to 0.7, but the issue persists. Also, I have noticed a recurring warning message during the inference stage in later epochs (beyond 15), with its frequency increasing alongside the number of epochs.

INFO [infer.py:375] (0/0) warning: invalid logp summation 32 0 tensor([0.7804], device='cuda:0')
INFO [infer.py:376] (0/0) original topk: tensor([[ -0.5859,  -1.5158,  -2.0178,  -3.0230,  -4.1113,  -4.2123,  -5.7169,
          -6.3797,  -6.4960,  -6.9256,  -7.0570,  -7.4171,  -7.7595,  -7.9360,
          -8.6797,  -8.7106,  -8.7192,  -8.7457,  -8.7556,  -9.0070,  -9.0190,
         -10.5356, -10.7244, -10.9377, -11.0772, -11.1239, -11.2510, -11.6562,
         -11.7628, -11.8229]], device='cuda:0') tensor([[ 241,   44,  271,  866,  290,  910,  183, 1084,  345,  250,  540, 1047,
          516,  594,  673,  487,  860, 1025, 1076,  716,  920,  672, 1000,  420,
          840,  206,  622,  169,  325,  521]], device='cuda:0')
INFO [infer.py:375] (0/0) warning: invalid logp summation 540 1 tensor([0.8828], device='cuda:0')
INFO [infer.py:376] (0/0) original topk: tensor([[-1.6281, -2.1522, -2.5497, -2.5561, -2.6812, -2.8340, -3.0771, -3.1544,
         -3.2050, -3.2146, -3.4744, -3.5684, -3.9655, -4.0229, -4.0986, -4.1056,
         -4.3907, -4.4246, -4.5331, -4.7468, -4.9400, -5.2673, -5.3790, -5.4710,
         -5.4988, -5.5380, -5.5408, -5.8946, -6.0102, -6.2124]],
       device='cuda:0') tensor([[1826, 3266, 1779, 1829, 1563, 1302, 1493, 2168, 1982, 1160, 1603, 2019,
         1827, 1279, 1856, 2097, 1902, 1931, 1905, 1335, 1459, 1940, 1764, 1950,
         1824, 1157, 1818, 1384, 1971, 1969]], device='cuda:0')
INFO [infer.py:375] (0/0) warning: invalid logp summation 561 2 tensor([0.8928], device='cuda:0')
INFO [infer.py:376] (0/0) original topk: tensor([[-1.5502, -1.8896, -2.0526, -2.2610, -2.5874, -2.9596, -3.6151, -3.7322,
         -3.9192, -4.3256, -4.5848, -4.6144, -4.6218, -4.8805, -4.8969, -5.0978,
         -5.6660, -5.7390, -5.7599, -5.8258, -5.8577, -5.8610, -5.9659, -5.9966,
         -6.0120, -6.0549, -6.1183, -6.2062, -6.2819, -6.3193]],
       device='cuda:0') tensor([[2859, 3146, 2536, 2915, 2755, 2497, 3195, 3119, 3169, 2620, 2556, 3050,
          757, 2558, 2641, 3364, 3327,  640, 1207, 2358, 3124,  775,  834, 3140,
         3041,  656, 3161,  413, 2387,  552]], device='cuda:0')

yangdongchao / UniAudio

Issue when training the model for longer epochs #30