yl4579 / StyleTTS2

StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models
MIT License
4.98k stars 422 forks source link

Cannot Convert float NaN to integer #234

Closed SimonDemarty closed 6 months ago

SimonDemarty commented 6 months ago

Hello

Thank you for this great model!

Here is an issue I faced when infering on a model I finetuned from LibriTTS:

Error Message

----> [8](vscode-notebook-cell:?execution_count=15&line=8) wav = inference(text2read, ref_s, alpha=0.3, beta=0.7, diffusion_steps=5, embedding_scale=1)
---> [41](vscode-notebook-cell:?execution_count=12&line=41) pred_aln_trg = torch.zeros(input_lengths, int(pred_dur.sum().data))

ValueError: cannot convert float NaN to integer

How did I end up there

  1. I finetuned the libriTTS model with my own data. and this config.yml
  2. I used the notebook to infer on some checkpoints of my model being finetuned
  3. this line wav = inference(text, ref_s, alpha=0.3, beta=0.7, diffusion_steps=5, embedding_scale=1) throws the error above.

Trying to find the error

The error is due to the inference function of the notebook, line pred_aln_trg = torch.zeros(input_lengths, int(pred_dur.sum().data))

pred_dur is:

tensor([nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
        nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
        nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
        nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan])

Then I went further:

pred_dur is computed after duration which is also weird:

tensor([[[nan, nan, nan,  ..., nan, nan, nan],
         [nan, nan, nan,  ..., nan, nan, nan],
         [nan, nan, nan,  ..., nan, nan, nan],
         ...,
         [nan, nan, nan,  ..., nan, nan, nan],
         [nan, nan, nan,  ..., nan, nan, nan],
         [nan, nan, nan,  ..., nan, nan, nan]]])

duration is computed after x which is computed after d, both are:

tensor([[[nan, nan, nan,  ..., nan, nan, nan],
         [nan, nan, nan,  ..., nan, nan, nan],
         [nan, nan, nan,  ..., nan, nan, nan],
         ...,
         [nan, nan, nan,  ..., nan, nan, nan],
         [nan, nan, nan,  ..., nan, nan, nan],
         [nan, nan, nan,  ..., nan, nan, nan]]])

d is computed after d_en, s, input_length and text_mask. Some of those variables have weird values:

d_en = tensor([[[-1.1971, -1.1575, -1.1993,  ..., -1.5842, -1.8575,  1.3819],
         [ 0.7255,  0.6195,  0.6080,  ...,  1.4167,  0.7423, -1.4919],
         [ 0.8665,  0.7836,  0.5722,  ..., -0.1133, -1.4566, -0.2431],
         ...,
         [ 0.5265,  0.3918,  0.4285,  ...,  0.6058, -1.0996,  0.4579],
         [-0.5145, -0.5324, -0.4254,  ..., -0.4495, -2.1733, -0.9024],
         [ 1.7131,  1.6266,  1.4982,  ...,  0.0907, -0.5698,  0.0803]]])

s = tensor([[nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
         nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
         nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
         nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
         nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
         nan, nan, nan, nan, nan, nan, nan, nan]])

input_length = tensor([92])

text_mask = tensor([[False, False, False, False, False, False, False, False, False, False,
         False, False, False, False, False, False, False, False, False, False,
         False, False, False, False, False, False, False, False, False, False,
         False, False, False, False, False, False, False, False, False, False,
         False, False, False, False, False, False, False, False, False, False,
         False, False, False, False, False, False, False, False, False, False,
         False, False, False, False, False, False, False, False, False, False,
         False, False, False, False, False, False, False, False, False, False,
         False, False, False, False, False, False, False, False, False, False,
         False, False]])

I assumed the error comes either from text_masks or from s. text_mask comes from input_length which seems fine so I checked s_pred (used to compute s):

s_pred = tensor([[nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
         nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
         nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
         nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
         nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
         nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
         nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
         nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
         nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
         nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
         nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan]])

And finally, to compute s_pred, the unknown value was bert_dur which seems fine:

bert_dur = tensor([[[-1.1328,  1.8791,  0.9180,  ..., -0.7228,  0.4993, -1.1337],
         [-1.1204,  1.6642,  0.9200,  ..., -0.5799,  0.4335, -1.3064],
         [-1.4370,  2.0107,  0.9015,  ..., -0.7810,  0.8929, -1.3102],
         ...,
         [-1.0730,  0.3842, -2.6536,  ..., -0.3925,  0.1857, -0.9404],
         [-1.6558, -1.8690,  1.9781,  ..., -2.3912, -1.8083, -2.5318],
         [-1.2736, -0.9525,  1.6579,  ..., -1.1642, -1.0868, -2.6429]]])

Questions

I wanted to know where the error was coming from, since values seems to be fine when starting inference but become NaN out of nowhere (at least so it seems).

I will continue to investigate this and will write here if I find the error. In the meantime, if you find what I did wrong, feel free to tell me.

Thanks in advance

SimonDemarty commented 6 months ago

I found the issue over the weekend:

I was not loading the model correctly. On the line: params_whole = torch.load("path/to/my/checkpoint.pth", map_location='cpu'), the parameter "path/to/my/checkpoint.pth" was incorrect...