yl4579 / StyleTTS2

StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models
MIT License
4.98k stars 422 forks source link

Fine-tuning worsens the quality of speech-synthesis. #177

Closed rumbleFTW closed 10 months ago

rumbleFTW commented 11 months ago

I tried fine-tuning the model using some of my accented data. The audio synthesis is quite good as long as the reference voice is a part of the fine-tune dataset. But if any unseen/different-gender data is used as a reference audio, then the synthesized audio is still more similar to the fine-tune data than the reference audio. Is this because it is overfitting/getting biased towards seen data?

yl4579 commented 10 months ago

Yes, if you want to generalize to unseen speakers, the model has to see a lot of speakers. The base model (LibriTTS) has around 1000 speakers. After fine-tuning, the model will effectively fit on whatever number of speakers you have on your fine-tuning data. If your fine-tuning data has less than 1000 speakers, the performance for unseen speakers will be worse than the base model.

platform-kit commented 10 months ago

@yl4579 Is there a way to fine-tune for a single speaker?

yl4579 commented 10 months ago

@platform-kit Yes, the Colab demo finetunes on LJSpeech. The default config was also for LJSpeech, which is a single speaker dataset.

platform-kit commented 10 months ago

@yl4579 Ah, great news. Say I want to fine tune on a single speaker. Can you give an estimate of how much input audio I need to train on, and how many epochs?

yl4579 commented 10 months ago

This is the result of one hour of fine-tuning: https://github.com/yl4579/StyleTTS2/discussions/65#discussioncomment-7668393 The demo uses only 15 minutes of a single speaker speech, and the results are reasonably good (you can run the demo to see it yourself).

rumbleFTW commented 10 months ago

@yl4579 Got it. So how do I fine-tune with multiple speaker? Do I just create the train_list.txt with multiple speaker IDs?

Also, I see that you are training a multi-lingual PL-BERT as well. Will it also add multi-lingual support to the Zero shot voice adaptation model?