yl4579 / StyleTTS2

StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models
MIT License
4.96k stars 417 forks source link

FineTuning under Windows Issue #79

Closed FlareP1 closed 12 months ago

FlareP1 commented 12 months ago

Hi thanks for this amazing TTS system, the inference is the best quality open source system that I have heard and works well and very fast under windows. However the fine tune script does not appear to work unmodified in the windows environment. I am trying to get the train_finetine.py to run locally under windows. I have made a couple of fixes (below) have have resolved some errors. 1) Python needs to be called with -Xutf8 to fource UTF8 locale 2) In _load_tenser(self, data) ~line 142 needs the following update osp.join(self.root_path, wave_path).replace("\","/") to ensure the correct slash is used within the file path when loading wav files.

However now I am stuck with the error below. Does anyone know what this might indicate? I can run the code in a debugger but I am not that familar with python to understand what is causing this error or what the correct behaviour should be.

Thanks in advance

(venv) C:\Users\xxxx\Documents\StyleTTS2>python -Xutf8 train_finetune.py --config_path ./Configs/config_ft.yml
Some weights of the model checkpoint at microsoft/wavlm-base-plus were not used when initializing WavLMModel: ['encoder.pos_conv_embed.conv.weight_v', 'encoder.pos_conv_embed.conv.weight_g']
- This IS expected if you are initializing WavLMModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing WavLMModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of WavLMModel were not initialized from the model checkpoint at microsoft/wavlm-base-plus and are newly initialized: ['encoder.pos_conv_embed.conv.parametrizations.weight.original0', 'encoder.pos_conv_embed.conv.parametrizations.weight.original1']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
bert loaded
bert_encoder loaded
predictor loaded
decoder loaded
text_encoder loaded
predictor_encoder loaded
style_encoder loaded
diffusion loaded
text_aligner loaded
pitch_extractor loaded
mpd loaded
msd loaded
wd loaded
BERT AdamW (
Parameter Group 0
    amsgrad: False
    base_momentum: 0.85
    betas: (0.9, 0.99)
    capturable: False
    differentiable: False
    eps: 1e-09
    foreach: None
    fused: None
    initial_lr: 1e-05
    lr: 1e-05
    max_lr: 2e-05
    max_momentum: 0.95
    maximize: False
    min_lr: 0
    weight_decay: 0.01
)
decoder AdamW (
Parameter Group 0
    amsgrad: False
    base_momentum: 0.85
    betas: (0.0, 0.99)
    capturable: False
    differentiable: False
    eps: 1e-09
    foreach: None
    fused: None
    initial_lr: 0.0001
    lr: 0.0001
    max_lr: 0.0002
    max_momentum: 0.95
    maximize: False
    min_lr: 0
    weight_decay: 0.0001
)
Data/wavs/LJ045-0051.wav
Data/wavs/LJ034-0213.wav
Data/wavs/LJ038-0268.wav
Data/wavs/LJ004-0067.wav
Data/wavs/LJ049-0084.wav
Data/wavs/LJ003-0198.wav
Data/wavs/LJ022-0011.wav
Data/wavs/LJ028-0352.wav
Data/wavs/LJ047-0047.wav
Data/wavs/LJ008-0175.wav
Data/wavs/LJ015-0273.wav
Data/wavs/LJ004-0067.wav
Data/wavs/LJ015-0100.wav
Data/wavs/LJ032-0052.wav
Data/wavs/LJ011-0105.wav
Data/wavs/LJ012-0036.wav
Data/wavs/LJ049-0118.wav
Data/wavs/LJ028-0352.wav
Data/wavs/LJ006-0132.wav
Data/wavs/LJ034-0114.wav
Traceback (most recent call last):
  File "C:\Users\Chris\Documents\StyleTTS2\train_finetune.py", line 707, in <module>
    main()
  File "C:\Users\Chris\Documents\StyleTTS2\venv\lib\site-packages\click\core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "C:\Users\Chris\Documents\StyleTTS2\venv\lib\site-packages\click\core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "C:\Users\Chris\Documents\StyleTTS2\venv\lib\site-packages\click\core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "C:\Users\Chris\Documents\StyleTTS2\venv\lib\site-packages\click\core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "C:\Users\Chris\Documents\StyleTTS2\train_finetune.py", line 396, in main
    y_rec_gt_pred = model.decoder(en, F0_real, N_real, s)
  File "C:\Users\Chris\Documents\StyleTTS2\venv\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "C:\Users\Chris\Documents\StyleTTS2\venv\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "C:\Users\Chris\Documents\StyleTTS2\venv\lib\site-packages\torch\nn\parallel\data_parallel.py", line 185, in forward
    outputs = self.parallel_apply(replicas, inputs, module_kwargs)
  File "C:\Users\Chris\Documents\StyleTTS2\venv\lib\site-packages\torch\nn\parallel\data_parallel.py", line 200, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "C:\Users\Chris\Documents\StyleTTS2\venv\lib\site-packages\torch\nn\parallel\parallel_apply.py", line 110, in parallel_apply
    output.reraise()
  File "C:\Users\Chris\Documents\StyleTTS2\venv\lib\site-packages\torch\_utils.py", line 694, in reraise
    raise exception
RuntimeError: Caught RuntimeError in replica 0 on device 0.
Original Traceback (most recent call last):
  File "C:\Users\Chris\Documents\StyleTTS2\venv\lib\site-packages\torch\nn\parallel\parallel_apply.py", line 85, in _worker
    output = module(*input, **kwargs)
  File "C:\Users\Chris\Documents\StyleTTS2\venv\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "C:\Users\Chris\Documents\StyleTTS2\venv\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "C:\Users\Chris\Documents\StyleTTS2\Modules\hifigan.py", line 458, in forward
    F0 = self.F0_conv(F0_curve.unsqueeze(1))
  File "C:\Users\Chris\Documents\StyleTTS2\venv\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "C:\Users\Chris\Documents\StyleTTS2\venv\lib\site-packages\torch\nn\modules\module.py", line 1568, in _call_impl
    result = forward_call(*args, **kwargs)
  File "C:\Users\Chris\Documents\StyleTTS2\venv\lib\site-packages\torch\nn\modules\conv.py", line 310, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "C:\Users\Chris\Documents\StyleTTS2\venv\lib\site-packages\torch\nn\modules\conv.py", line 306, in _conv_forward
    return F.conv1d(input, weight, bias, self.stride,
RuntimeError: Given groups=1, weight of size [1, 1, 3], expected input[1, 100, 1] to have 1 channels, but got 100 channels instead
FlareP1 commented 12 months ago

Update: I fixed this issue which is actually due to the finetuning only supporting a single GPU. If I run after SET CUDA_VISIBLE_DEVICES=0 to force only 1 GPU, then the finetuning does run.

yl4579 commented 12 months ago

Isn’t it caused by batch size though? The default setting should work for multiple GPUs. I tested the script with 4 A100.