yl4579 / StyleTTS

Official Implementation of StyleTTS
MIT License
396 stars 64 forks source link

Why don't use "attention_weight" in train_first.py ? #30

Closed dy2009 closed 1 year ago

dy2009 commented 1 year ago

Why don't use "attention_weight" in train_first.py ? the code used is "alignment". in train_first.py line 153: ppgs, s2s_pred, s2s_attn_feat = model.text_aligner(mels, mask, texts)

in layer.py: attention_weights = F.softmax(alignment, dim=1)

attention_weight is the result after softmax, but alignment is not.

yl4579 commented 1 year ago

See https://github.com/yl4579/StyleTTS/issues/9#issuecomment-1543269100

The change was to make the alignment over the phoneme axis instead of over the mel axis. In the AuxiliaryASR, it was trained to align melspectrograms with texts (i.e., the input is melspectrograms and the output is the text) because it is an ASR model, while in TTS we want the input to be texts and the output is the melspectrogram. The latter is the reverse of the problem, but you cannot simply transpose the attention matrix because the attention for AuxiliaryASR was normalized across the melspectrogram frames instead of the phoneme tokens. I changed it specifically to renormalize it in the correct axis.