yl4579 / StyleTTS2

StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models
MIT License
4.87k stars 407 forks source link

SLM adversarial training: 3 - 6 seconds in duration? #23

Closed stevenhillis closed 12 months ago

stevenhillis commented 12 months ago

I'm having trouble reconciling the paper and the code when it comes to the min_len and max_len for the slmadv_params. They are set to 400 and 500 respectively here, but the paper states: "For SLM adversarial training, both the ground truth and generated samples were ensured to be 3 to 6 seconds in duration". I'm not sure how exactly to interpret the units on min_len and max_len, but that ratio definitely doesn't line up with 3-6 seconds. Those parameters get used here, where they're halved and compared against the number of mel frames. If that's the correct interpretation, then with the mel transform here giving us ~80 frames per second of 24k audio, I think min_len would be set to 480 and max_len would be set to 960 to match the paper. Is that correct? Can you help clear this up for me?

yl4579 commented 12 months ago

I think your calculations are not correct. Each second is 80 frames, so 6 seconds would be 480 instead of 960, and 3 seconds would be 240 frames. Setting it to 400 to 500 corresponds to roughly 5 to 6 seconds, but you can definitely do 3 to 6 seconds too, which I personally have found no difference between them.

stevenhillis commented 12 months ago

That's very helpful, thank you! I'm still a little confused about why min_len and max_len are being divided by 2 here, though: https://github.com/yl4579/StyleTTS2/blob/3e300817a2433e2c3832703e7973183cbf98effe/Modules/slmadv.py#L95 Can you help me interpret that?

yl4579 commented 12 months ago

That 2 is actually hardcoding. This is caused by the n_down of the text aligner being 2, so the length of aligned phoneme representation is halved. I will fix it all the hardcoding later when I get time.