Open signofthefour opened 1 month ago
Dear team,
Thank you for introducing the world an amazing work.
Could you please tell me how long it took to train the model? I am reproducing the results using different setting. So, I want to know the time it took to ensure everything is correct. By the way, did you track how many training steps UniAudio needs to generate plausible speech, in speech synthesis tasks? Are you still working on releasing pretrained model?
Yes, I am working on it. I have been trained a better codec model, which can improve the up-limit of generation quality. I am focusing on collecting more data and seeking for GPUs. For the model training, I am training the model one epoch with more than 10k hours data.
@christophschuhmann Could you help with GPU?
@yangdongchao Thank you for confirmation. We can reproduce TTS-only models with just 4GPUs. The quality is acceptable. Hope that we can discuss more later. Otherwise, I also believe that codec quality sets the upper limit for the generation quality. Even your GPT-like model performs well but the output quality still highly depends on the codec. Goodluck!
I am not sure if this is helpful, but how about using in-the-wild speech conversasion dataset: https://mmai.io/datasets/ other conversation dataset: https://github.com/keonlee9420/DailyTalk -> I think that using this kind of dataset somehow pushes the codec to capture conversation-like speech. U know, GPT-like somehow can generate conversation such as demos of SoundStream or Spectron. That will be a good contribution, I guess.
@yangdongchao Thank you for confirmation. We can reproduce TTS-only models with just 4GPUs. The quality is acceptable. Hope that we can discuss more later. Otherwise, I also believe that codec quality sets the upper limit for the generation quality. Even your GPT-like model performs well but the output quality still highly depends on the codec. Goodluck!
I am not sure if this is helpful, but how about using in-the-wild speech conversasion dataset: https://mmai.io/datasets/ other conversation dataset: https://github.com/keonlee9420/DailyTalk -> I think that using this kind of dataset somehow pushes the codec to capture conversation-like speech. U know, GPT-like somehow can generate conversation such as demos of SoundStream or Spectron. That will be a good contribution, I guess.
Thank you very much! I think using conversation-like speech is a good idea.
Dear team,
Thank you for introducing the world an amazing work.
Could you please tell me how long it took to train the model? I am reproducing the results using different setting. So, I want to know the time it took to ensure everything is correct. By the way, did you track how many training steps UniAudio needs to generate plausible speech, in speech synthesis tasks? Are you still working on releasing pretrained model?