yangdongchao / UniAudio

The Open Source Code of UniAudio
http://dongchaoyang.top/UniAudio_demo/
471 stars 30 forks source link

How long did the model take to train? #27

Open signofthefour opened 1 month ago

signofthefour commented 1 month ago

Dear team,

Thank you for introducing the world an amazing work.

Could you please tell me how long it took to train the model? I am reproducing the results using different setting. So, I want to know the time it took to ensure everything is correct. By the way, did you track how many training steps UniAudio needs to generate plausible speech, in speech synthesis tasks? Are you still working on releasing pretrained model?

yangdongchao commented 2 weeks ago

Dear team,

Thank you for introducing the world an amazing work.

Could you please tell me how long it took to train the model? I am reproducing the results using different setting. So, I want to know the time it took to ensure everything is correct. By the way, did you track how many training steps UniAudio needs to generate plausible speech, in speech synthesis tasks? Are you still working on releasing pretrained model?

Yes, I am working on it. I have been trained a better codec model, which can improve the up-limit of generation quality. I am focusing on collecting more data and seeking for GPUs. For the model training, I am training the model one epoch with more than 10k hours data.

patriotyk commented 2 weeks ago

@christophschuhmann Could you help with GPU?

signofthefour commented 2 weeks ago

@yangdongchao Thank you for confirmation. We can reproduce TTS-only models with just 4GPUs. The quality is acceptable. Hope that we can discuss more later. Otherwise, I also believe that codec quality sets the upper limit for the generation quality. Even your GPT-like model performs well but the output quality still highly depends on the codec. Goodluck!

I am not sure if this is helpful, but how about using in-the-wild speech conversasion dataset: https://mmai.io/datasets/ other conversation dataset: https://github.com/keonlee9420/DailyTalk -> I think that using this kind of dataset somehow pushes the codec to capture conversation-like speech. U know, GPT-like somehow can generate conversation such as demos of SoundStream or Spectron. That will be a good contribution, I guess.

yangdongchao commented 1 week ago

@yangdongchao Thank you for confirmation. We can reproduce TTS-only models with just 4GPUs. The quality is acceptable. Hope that we can discuss more later. Otherwise, I also believe that codec quality sets the upper limit for the generation quality. Even your GPT-like model performs well but the output quality still highly depends on the codec. Goodluck!

I am not sure if this is helpful, but how about using in-the-wild speech conversasion dataset: https://mmai.io/datasets/ other conversation dataset: https://github.com/keonlee9420/DailyTalk -> I think that using this kind of dataset somehow pushes the codec to capture conversation-like speech. U know, GPT-like somehow can generate conversation such as demos of SoundStream or Spectron. That will be a good contribution, I guess.

Thank you very much! I think using conversation-like speech is a good idea.