Closed dy2009 closed 1 year ago
See https://github.com/yl4579/StyleTTS/issues/9#issuecomment-1543269100
The change was to make the alignment over the phoneme axis instead of over the mel axis. In the AuxiliaryASR, it was trained to align melspectrograms with texts (i.e., the input is melspectrograms and the output is the text) because it is an ASR model, while in TTS we want the input to be texts and the output is the melspectrogram. The latter is the reverse of the problem, but you cannot simply transpose the attention matrix because the attention for AuxiliaryASR was normalized across the melspectrogram frames instead of the phoneme tokens. I changed it specifically to renormalize it in the correct axis.
Why don't use "attention_weight" in train_first.py ? the code used is "alignment". in train_first.py line 153: ppgs, s2s_pred, s2s_attn_feat = model.text_aligner(mels, mask, texts)
in layer.py: attention_weights = F.softmax(alignment, dim=1)
attention_weight is the result after softmax, but alignment is not.