yl4579 / StyleTTS2

StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models
MIT License
4.49k stars 355 forks source link

Questions about Differentiable Duration Modeling #264

Open RoversCode opened 3 weeks ago

RoversCode commented 3 weeks ago

In the paper, the final formula for the model is 381C7CA2-3E38-4E06-AEF0-80E39D01A8B2 However, the Gaussian kernel is written as follows in the code.

loc = torch.cumsum(_dur_pred, dim=0) - _dur_pred / 2
h = torch.exp(
    -0.5 * torch.square(t - (l - loc.unsqueeze(-1))) / (1.5) ** 2
)

Why is it t - (l - loc.unsqueeze(-1)) and not t - l - loc.unsqueeze(-1) ? It seems that this does not match the formula shown in the paper. I would like to know the reason and hope to get some replies. Thank you.

brthor commented 3 weeks ago

I have found a few aspects that do not match the paper. It seems this is from a research codebase that is likely evolving quickly.