yl4579 / StyleTTS2

StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models
MIT License
4.98k stars 421 forks source link

Inference latency #288

Open Ananya21162 opened 1 month ago

Ananya21162 commented 1 month ago

I was trying out the model with 439 characters and saw 5-6 sec of average latency on libri-TTS dataset. Is there a way we can reduce the latency (decoder takes the most time). Also, I finetuned the model with a few samples from a new speaker and saw the latency increased by 600-700 ms further, is this expected? Is the latency expected to increase if the dataset is larger (english only)? Similarly if we add more languages, is the model inference latency going to increase?

Respaired commented 1 month ago

HifiGAN is essentially larger and heavier.

you need to either find another ckpt pretrained on ISTFT or train a new model yourself from scratch. you can also fine tune on top of the LJ ckpt which is not recommended but one of my friends managed to get reasonable results by doing so.

as for your other questions, no the dataset have no impact on the latency. only the parameters of your model and mainly the size of the decoder matters the most.

Ananya21162 commented 1 month ago

Thanks for your reply. We have two models; one trained on libriTTS-R (360+100 hrs) data and the other finetuned on this model with 20 min audio samples for multiple speakers. We kept max_len 100 for the first and 400 for the second one. The first model and the second one have an average latency difference of nearly 1.5 sec. Is it because of this parameter? What should be the ideal value?

Respaired commented 1 month ago

You're welcome. as i've said, your choice of max_len or the dataset shouldn't matter. only the decoder has the largest impact.

Ananya21162 commented 1 month ago

Understood. But in our experiment, we checked the size of decoder for both models mentioned above. It was same for both , 217 MBs. But still both models have a latency difference of 1.5 seconds. Do you know of any other possible cause? In fact, we compared all the model components and they are consistent for all

bert size: 201359360 / bit | 25.17 / MB
bert_encoder size: 12599296 / bit | 1.57 / MB
predictor size: 518227584 / bit | 64.78 / MB
decoder size: 1737263744 / bit | 217.16 / MB
text_encoder size: 179404800 / bit | 22.43 / MB
predictor_encoder size: 444186016 / bit | 55.52 / MB
style_encoder size: 444186016 / bit | 55.52 / MB
diffusion size: 1620926464 / bit | 202.62 / MB
text_aligner size: 251790464 / bit | 31.47 / MB
pitch_extractor size: 168037024 / bit | 21.00 / MB
mpd size: 1315384640 / bit | 164.42 / MB
msd size: 8988864 / bit | 1.12 / MB
wd size: 37556288 / bit | 4.69 / MB
Total Model size: 6939910560 / bit | 867.49 / MB
Ananya21162 commented 1 month ago

Also, one model is trained from scratch and the other one is fine-tuned. Will that make any difference? Num of Model params & model size is same :/

Respaired commented 2 weeks ago

Unless you change the decoder, or use very short samples with LFInference, there must not be a whole lot of latency overhead