yl4579 / StyleTTS2

StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models
MIT License
4.97k stars 419 forks source link

max_len doesnt crop samples properly #290

Open FormMe opened 4 weeks ago

FormMe commented 4 weeks ago

Hi. It seems that max_len doesnt work properly.

mel_len should be mel_input_length_all.max(), not mel_input_length_all.min() It leads that we select the maximum length as minimum length in batch. With this formula we will select max_len only when the minimum length in batch will be greater than max_len

mel_input_length_all = accelerator.gather(mel_input_length)  # for balanced load
mel_len = min([int(mel_input_length_all.min().item() / 2 - 1), max_len // 2])
mel_len_st = int(mel_input_length.min().item() / 2 - 1)

For example if max_len==400, maximum length of mels in batch was 600 and minimum is 92 with whis formula we assign mel_len=min(92, 400)= 92 Thus, all samples in clipped batch will be with maximum length of 92 because we do

gt.append(mels[bib, :, (random_start * 2) : ((random_start + mel_len) * 2)])

It means that we always train on samples with minimum lenght in tha batch. Here some shapes for example

print(mels.shape, gt.shape, st.shape, wav.shape)
torch.Size([32, 80, 662]) torch.Size([32, 80, 92]) torch.Size([32, 80, 96]) torch.Size([32, 27600])
torch.Size([32, 80, 434]) torch.Size([32, 80, 92]) torch.Size([32, 80, 92]) torch.Size([32, 27600])
torch.Size([32, 80, 844]) torch.Size([32, 80, 92]) torch.Size([32, 80, 92]) torch.Size([32, 27600])

27600/300=92 (300 is hop len)

Also random_start leads to cropping the begging of samples that less than max_len and using padding instead More over we skip many of samples

if gt.shape[-1] < 80:
   continue

To fix it we should crop only samples which length is greater than max_len

Did I noticed the bug or I dont understand something?

FormMe commented 3 weeks ago

Hello @yl4579. Could you exlain it please?

martinambrus commented 2 weeks ago

Hello @yl4579. Could you exlain it please?

I'm afraid @yl4579 left the community around the time this repository was last updated. He, and most of the initial contributors, no longer respond to any questions. You might be able to find some answers if you double-post this into Discussions as well, however the community all seem to have moved on to their own versions of StyleTTS2, including some commercial forks without contributing back to the community - which is a shame really. But that's the state of things right now.

Respaired commented 2 weeks ago

I think there's nothing wrong with the code itself and it's working as intended. the purpose of that line is probably not to take the biggest sample in the batch but rather to ensure no sample in your batch goes beyond that threshold. the Author's previous works also work in a similar way.

I've tried doing the other way by padding / trimming all the samples to ensure they're always at max_len if they were not, this will drastically increase the memory consumption as one would expect if you use a max len close to 10 seconds of audio. unless i'm confused about what you're trying to say, it's not a good idea to do that.