yl4579 / StyleTTS2

StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models
MIT License
4.95k stars 417 forks source link

Finetune Error Message #48

Closed GUUser91 closed 12 months ago

GUUser91 commented 12 months ago

I get this error message when I try to finetune. I set batch_size to 12 and max_len to 14. I'm using torch-2.1.1 torchaudio-2.1.1 torchvision-0.16.1 if that matters.

python train_finetune.py --config_path ./Configs/config_ft.yml Some weights of the model checkpoint at microsoft/wavlm-base-plus were not used when initializing WavLMModel: ['encoder.pos_conv_embed.conv.weight_v', 'encoder.pos_conv_embed.conv.weight_g']

  • This IS expected if you are initializing WavLMModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
  • This IS NOT expected if you are initializing WavLMModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model). Some weights of WavLMModel were not initialized from the model checkpoint at microsoft/wavlm-base-plus and are newly initialized: ['encoder.pos_conv_embed.conv.parametrizations.weight.original0', 'encoder.pos_conv_embed.conv.parametrizations.weight.original1'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. bert loaded bert_encoder loaded predictor loaded decoder loaded text_encoder loaded predictor_encoder loaded style_encoder loaded text_aligner loaded pitch_extractor loaded mpd loaded msd loaded wd loaded BERT AdamW ( Parameter Group 0 amsgrad: False base_momentum: 0.85 betas: (0.9, 0.99) capturable: False differentiable: False eps: 1e-09 foreach: None fused: None initial_lr: 1e-05 lr: 1e-05 max_lr: 2e-05 max_momentum: 0.95 maximize: False min_lr: 0 weight_decay: 0.01 ) decoder AdamW ( Parameter Group 0 amsgrad: False base_momentum: 0.85 betas: (0.0, 0.99) capturable: False differentiable: False eps: 1e-09 foreach: None fused: None initial_lr: 0.0001 lr: 0.0001 max_lr: 0.0002 max_momentum: 0.95 maximize: False min_lr: 0 weight_decay: 0.0001

Traceback (most recent call last): File "/home/user/StyleTTS2/train_finetune.py", line 714, in main() File "/home/user/StyleTTS2/venv/lib/python3.10/site-packages/click/core.py", line 1157, in call return self.main(args, kwargs) File "/home/user/StyleTTS2/venv/lib/python3.10/site-packages/click/core.py", line 1078, in main rv = self.invoke(ctx) File "/home/user/StyleTTS2/venv/lib/python3.10/site-packages/click/core.py", line 1434, in invoke return ctx.invoke(self.callback, ctx.params) File "/home/user/StyleTTS2/venv/lib/python3.10/site-packages/click/core.py", line 783, in invoke return __callback(args, kwargs) File "/home/user/StyleTTS2/train_finetune.py", line 302, in main s = model.predictor_encoder(mel.unsqueeze(0).unsqueeze(1)) File "/home/user/StyleTTS2/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/home/user/StyleTTS2/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(args, kwargs) File "/home/user/StyleTTS2/venv/lib/python3.10/site-packages/torch/nn/parallel/data_parallel.py", line 183, in forward return self.module(*inputs[0], module_kwargs[0]) File "/home/user/StyleTTS2/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/home/user/StyleTTS2/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(args, kwargs) File "/home/user/StyleTTS2/models.py", line 160, in forward h = self.shared(x) File "/home/user/StyleTTS2/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, kwargs) File "/home/user/StyleTTS2/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, *kwargs) File "/home/user/StyleTTS2/venv/lib/python3.10/site-packages/torch/nn/modules/container.py", line 215, in forward input = module(input) File "/home/user/StyleTTS2/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/home/user/StyleTTS2/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1568, in _call_impl result = forward_call(*args, **kwargs) File "/home/user/StyleTTS2/venv/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 460, in forward return self._conv_forward(input, self.weight, self.bias) File "/home/user/StyleTTS2/venv/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 456, in _conv_forward return F.conv2d(input, weight, bias, self.stride, RuntimeError: Calculated padded input size per channel: (5 x 4). Kernel size: (5 x 5). Kernel size can't be greater than actual input size

yl4579 commented 12 months ago

max_len = 14 means you are only training with 14 * 300 / 24000 = 0.175 second of audio, which is not feasible at all. You will need at least max_len = 80, which is one second of clip, to work. Try increase max_len to at least 80 and decrease the batch size instead, as long as your batch size is greater than 1 you should be fine.