Why don't use "attention_weight" in train_first.py ?

See https://github.com/yl4579/StyleTTS/issues/9#issuecomment-1543269100

The change was to make the alignment over the phoneme axis instead of over the mel axis. In the AuxiliaryASR, it was trained to align melspectrograms with texts (i.e., the input is melspectrograms and the output is the text) because it is an ASR model, while in TTS we want the input to be texts and the output is the melspectrogram. The latter is the reverse of the problem, but you cannot simply transpose the attention matrix because the attention for AuxiliaryASR was normalized across the melspectrogram frames instead of the phoneme tokens. I changed it specifically to renormalize it in the correct axis.

yl4579 / StyleTTS

Why don't use "attention_weight" in train_first.py ? #30