yangdongchao / UniAudio

The Open Source Code of UniAudio
http://dongchaoyang.top/UniAudio_demo/
486 stars 31 forks source link

Smaller model does NOT work for TTS task #12

Open anonymoussky opened 9 months ago

anonymoussky commented 9 months ago

I trained a smaller model for TTS task only as the TTS recipe instructed. However, it can not generate the correct TTS audio output (the generated audio content is not matched to the text). Have you ever tried smaller models? In this way, we can make sure that the whole code is correct before running a big LLM. Thanks

Smaller model configuration: n_layer = 8 n_head = 8 n_embd = 768

Task: TTS only on the whole LibriTTS dataset

Single machine with 8 GPU cards 500 epochs

Recipe: same as https://github.com/yangdongchao/UniAudio/tree/main/UniAudio/egs/TTS

yangdongchao commented 9 months ago

I trained a smaller model for TTS task only as the TTS recipe instructed. However, it can not generate the correct TTS audio output (the generated audio content is not matched to the text). Have you ever tried smaller models? In this way, we can make sure that the whole code is correct before running a big LLM. Thanks

Smaller model configuration: n_layer = 8 n_head = 8 n_embd = 768

Task: TTS only on the whole LibriTTS dataset

Single machine with 8 GPU cards 500 epochs

Recipe: same as https://github.com/yangdongchao/UniAudio/tree/main/UniAudio/egs/TTS

Hi, we try to use 12 layer transformer model. It can work. I suggest you try to use training data to infer, to check whether the training is right. Furthermore, you can try to use seen speaker as prompt, becuase LibriTTS seems donot includes enough speaker, the zero-shot clone ability may be affected. Lastly, if you only want to train TTS task, you can close other tokenizer, such as sing_phone. But I belive this may not bring influence.

anonymoussky commented 9 months ago

Thanks for your suggestion. My goal is not to train TTS task only, However, I want to have a quick verification about the whole code using the TTS recipe. But, I tried to inference on the training set. It does NOT work either. It looks like a random speaking, hallucination? There are three files output from the infernece process: 1) tts_5678_43301_000026_000000_sampling_sample0.wav --- it should be the TTS output, not correct 2) tts_6000_55211_000057_000003_input1.wav --- audio prompt 3) tts_6000_55211_000057_000003_input2.wav --- ground truth audio?

Could you please also try the same smaller model's configuration on the TTS task only and train from scracth? I think it should be very fast because you already have the libriTTS dataset. n_layer = 8 n_head = 8 n_embd = 768