Open anonymoussky opened 9 months ago
I trained a smaller model for TTS task only as the TTS recipe instructed. However, it can not generate the correct TTS audio output (the generated audio content is not matched to the text). Have you ever tried smaller models? In this way, we can make sure that the whole code is correct before running a big LLM. Thanks
Smaller model configuration: n_layer = 8 n_head = 8 n_embd = 768
Task: TTS only on the whole LibriTTS dataset
Single machine with 8 GPU cards 500 epochs
Recipe: same as https://github.com/yangdongchao/UniAudio/tree/main/UniAudio/egs/TTS
Hi, we try to use 12 layer transformer model. It can work. I suggest you try to use training data to infer, to check whether the training is right. Furthermore, you can try to use seen speaker as prompt, becuase LibriTTS seems donot includes enough speaker, the zero-shot clone ability may be affected. Lastly, if you only want to train TTS task, you can close other tokenizer, such as sing_phone. But I belive this may not bring influence.
Thanks for your suggestion. My goal is not to train TTS task only, However, I want to have a quick verification about the whole code using the TTS recipe. But, I tried to inference on the training set. It does NOT work either. It looks like a random speaking, hallucination? There are three files output from the infernece process: 1) tts_5678_43301_000026_000000_sampling_sample0.wav --- it should be the TTS output, not correct 2) tts_6000_55211_000057_000003_input1.wav --- audio prompt 3) tts_6000_55211_000057_000003_input2.wav --- ground truth audio?
Could you please also try the same smaller model's configuration on the TTS task only and train from scracth? I think it should be very fast because you already have the libriTTS dataset. n_layer = 8 n_head = 8 n_embd = 768
I trained a smaller model for TTS task only as the TTS recipe instructed. However, it can not generate the correct TTS audio output (the generated audio content is not matched to the text). Have you ever tried smaller models? In this way, we can make sure that the whole code is correct before running a big LLM. Thanks
Smaller model configuration: n_layer = 8 n_head = 8 n_embd = 768
Task: TTS only on the whole LibriTTS dataset
Single machine with 8 GPU cards 500 epochs
Recipe: same as https://github.com/yangdongchao/UniAudio/tree/main/UniAudio/egs/TTS