yl4579 / StyleTTS2

StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models
MIT License
4.47k stars 355 forks source link

About the pretrained model on LibriTTS #20

Closed zerovcv closed 8 months ago

zerovcv commented 9 months ago

Hello, thank you for sharing such interesting work! May I know what your plans are for sharing the pretrained model trained on LibriTTS? Thanks! :)

sh-lee-prml commented 9 months ago

@yl4579

Thanks for nice works

I'm looking for nice zero-shot TTS models. I also hope to use a StyleTTS 2 (LibriTTS Ver.) for a baseline model.

I think your model is much better than YourTTS or any recent LLM-based models in zero-shot TTS so I hope to compare your model as a state-of-the-art model.

It would be appreciate if you could share a plan for LibriTTS model 😃

WendongGan commented 9 months ago

@yl4579 hi, yl4579. Look forward to the pretrained model on LibriTTS. Be grateful to you!

yl4579 commented 9 months ago

Thank you for your interest in this work. I’m currently attending a few workshops and I’ll be busy with midterm exams after that, so the model release will be delayed a little bit. Expect it to arrive some time in early November.

gigadunk commented 8 months ago

Hey there @yl4579 , I'm hoping to test out the LibriTTS-trained StyleTTS 2 as well. Would it be possible to release the training config for the multi-speaker version so I can try and train it on my own machines before you release the pre-trained models?

P.S. Thanks for the work so far; the LJ version sounds very good.

yl4579 commented 8 months ago

@gigadunk Here's the configuration that I am currently using to train the LibriTTS model. The dataset is very big so the epochs need to be adjusted according to the quality of the model.

log_dir: "Models/LibriTTS"
first_stage_path: "first_stage.pth"
save_freq: 1
log_interval: 10
device: "cuda"
epochs_1st: 50 # number of epochs for first stage training (pre-training)
epochs_2nd: 30 # number of peochs for second stage training (joint training)
batch_size: 16
max_len: 300 # maximum number of frames
pretrained_model: "Models/LibriTTS/epoch_2nd_00005.pth"
second_stage_load_pretrained: true # set to true if the pre-trained model is for 2nd stage
load_only_params: false # set to true if do not want to load epoch numbers and optimizer parameters

F0_path: "Utils/JDC/bst.t7"
ASR_config: "Utils/ASR/config.yml"
ASR_path: "Utils/ASR/epoch_00080.pth"
PLBERT_dir: 'Utils/PLBERT/'

data_params:
  train_data: "Data/train_list.txt"
  val_data: "Data/val_list.txt"
  root_path: ""
  OOD_data: "Data/OOD_texts.txt"
  min_length: 50 # sample until texts with this size are obtained for OOD texts

preprocess_params:
  sr: 24000
  spect_params:
    n_fft: 2048
    win_length: 1200
    hop_length: 300

model_params:
  multispeaker: true

  dim_in: 64 
  hidden_dim: 512
  max_conv_dim: 512
  n_layer: 3
  n_mels: 80

  n_token: 178 # number of phoneme tokens
  max_dur: 50 # maximum duration of a single phoneme
  style_dim: 128 # style vector size

  dropout: 0.2

  # config for decoder
  decoder: 
      type: 'hifigan' # either hifigan or istftnet
      resblock_kernel_sizes: [3,7,11]
      upsample_rates :  [10,5,3,2]
      upsample_initial_channel: 512
      resblock_dilation_sizes: [[1,3,5], [1,3,5], [1,3,5]]
      upsample_kernel_sizes: [20,10,6,4]

  # speech language model config
  slm:
      model: 'microsoft/wavlm-base-plus'
      sr: 16000 # sampling rate of SLM
      hidden: 768 # hidden size of SLM
      nlayers: 13 # number of layers of SLM
      initial_channel: 64 # initial channels of SLM discriminator head

  # style diffusion model config
  diffusion:
    embedding_mask_proba: 0.1
    # transformer config
    transformer:
      num_layers: 3
      num_heads: 8
      head_features: 64
      multiplier: 2

    # diffusion distribution config
    dist:
      sigma_data: 0.2 # placeholder for estimate_sigma_data set to false
      estimate_sigma_data: true # estimate sigma_data from the current batch if set to true
      mean: -3.0
      std: 1.0

loss_params:
    lambda_mel: 5. # mel reconstruction loss
    lambda_gen: 1. # generator loss
    lambda_slm: 1. # slm feature matching loss

    lambda_mono: 1. # monotonic alignment loss (1st stage, TMA)
    lambda_s2s: 1. # sequence-to-sequence loss (1st stage, TMA)
    TMA_epoch: 4 # TMA starting epoch (1st stage)

    lambda_F0: 1. # F0 reconstruction loss (2nd stage)
    lambda_norm: 1. # norm reconstruction loss (2nd stage)
    lambda_dur: 1. # duration loss (2nd stage)
    lambda_ce: 20. # duration predictor probability output CE loss (2nd stage)
    lambda_sty: 1. # style reconstruction loss (2nd stage)
    lambda_diff: 1. # score matching loss (2nd stage)

    diff_epoch: 10 # style diffusion starting epoch (2nd stage)
    joint_epoch: 15 # joint training starting epoch (2nd stage)

optimizer_params:
  lr: 0.0001 # general learning rate
  bert_lr: 0.00001 # learning rate for PLBERT
  ft_lr: 0.00001 # learning rate for acoustic modules

slmadv_params:
  min_len: 400 # minimum length of samples
  max_len: 500 # maximum length of samples
  batch_percentage: 0.5 # to prevent out of memory, only use half of the original batch size
  iter: 20 # update the discriminator every this iterations of generator update
  thresh: 5 # gradient norm above which the gradient is scaled
  scale: 0.01 # gradient scaling factor for predictors from SLM discriminators
  sig: 1.5 # sigma for differentiable duration modeling
yl4579 commented 8 months ago

Unfortunately, somebody found a mistake in the training code and informed me via email. I checked the quality of the model, and it sounds worse than the demo because of the mistake (wrong reference audio). I have fixed the mistake but I have to retrain the model from scratch. Now expect the model to be released by mid-November. Sorry for the delay. I believe the current code should produce working models now.

yl4579 commented 8 months ago

The current model quality is not bad though, so if you need the model now, you can download it here: https://drive.google.com/drive/folders/1ApqjyugCzr4EN2NFXa5Opfr3qcoapUPV?usp=sharing, but I can probably get a better model a couple of weeks later.

You only need to change the following code to run the inference:


def compute_style(path):
    wave, sr = librosa.load(path, sr=24000)
    audio, index = librosa.effects.trim(wave, top_db=30)
    if sr != 24000:
        audio = librosa.resample(audio, sr, 24000)
    mel_tensor = preprocess(audio).to(device)

    with torch.no_grad():
        ref_s = model.style_encoder(mel_tensor.unsqueeze(1))
        ref_p = model.predictor_encoder(mel_tensor.unsqueeze(1))

    return torch.cat([ref_s, ref_p], dim=1)
reference = "Demo/1221-135767-0014.wav"
ref_s = compute_style(reference)

with torch.no_grad():
    input_lengths = torch.LongTensor([tokens.shape[-1]]).to(device)
    text_mask = length_to_mask(input_lengths).to(device)

    t_en = model.text_encoder(tokens, input_lengths, text_mask)
    bert_dur = model.bert(tokens, attention_mask=(~text_mask).int())
    d_en = model.bert_encoder(bert_dur).transpose(-1, -2) 

    s_pred = sampler(noise = torch.randn((1, 256)).unsqueeze(1).to(device), 
                                      embedding=bert_dur,
                                      embedding_scale=1,
                                        features=ref_s, # reference from the same speaker as the embedding
                                         num_steps=10).squeeze(1)

    s = s_pred[:, 128:]
    ref = s_pred[:, :128]

    alpha = 0.3 # how much you want to mix the sampled style with the original style (acoustic part)
    beta = 0.7 # how much you want to mix the sampled style with the original style (prosodic part)
    ref = alpha * ref + (1 - alpha)  * ref_s[:, :128] 
    s = beta * s + (1 - beta)  * ref_s[:, 128:]

    d = model.predictor.text_encoder(d_en, 
                                     s, input_lengths, text_mask)

    x, _ = model.predictor.lstm(d)
    duration = model.predictor.duration_proj(x)

    duration = torch.sigmoid(duration).sum(axis=-1)
    pred_dur = torch.round(duration.squeeze()).clamp(min=1)

    pred_aln_trg = torch.zeros(input_lengths, int(pred_dur.sum().data))
    c_frame = 0
    for i in range(pred_aln_trg.size(0)):
        pred_aln_trg[i, c_frame:c_frame + int(pred_dur[i].data)] = 1
        c_frame += int(pred_dur[i].data)

    # encode prosody
    en = (d.transpose(-1, -2) @ pred_aln_trg.unsqueeze(0).to(device))
    if model_params.decoder.type == "hifigan": # fix weird misalignment for hifigan decoder
        asr_new = torch.zeros_like(en)
        asr_new[:, :, 0] = en[:, :, 0]
        asr_new[:, :, 1:] = en[:, :, 0:-1]
        en = asr_new

    F0_pred, N_pred = model.predictor.F0Ntrain(en, s)

    asr = (t_en @ pred_aln_trg.unsqueeze(0).to(device))
    if model_params.decoder.type == "hifigan": # fix weird misalignment for hifigan decoder
        asr_new = torch.zeros_like(asr)
        asr_new[:, :, 0] = asr[:, :, 0]
        asr_new[:, :, 1:] = asr[:, :, 0:-1]
        asr = asr_new

    out = model.decoder(asr, 
                            F0_pred, N_pred, ref.squeeze().unsqueeze(0))

Formal inference demo including reproducing audio on the demo page will come later once the better model is done.

yl4579 commented 8 months ago

I tested it on the colab and it works, so if you want to try it now you can use this link: https://colab.research.google.com/drive/1VENAg_TeKj5a1NYMJTSrbNLDlcIT30Sh

GUUser91 commented 8 months ago

@yl4579 Since you're restarting the model from scratch, have you thought of implementing Descript Audio Codec? With Descript Audio Codec, you can compress 44.1 KHz audio into discrete codes at a low 8 kbps bitrate. This universal model works on all domains (speech, environment, music, etc.), making it widely applicable to generative modeling of all audio. It can be used as a drop-in replacement for EnCodec for all audio language modeling applications (such as AudioLMs, MusicLMs, MusicGen, etc.) https://github.com/descriptinc/descript-audio-codec Demo: https://descript.notion.site/Descript-Audio-Codec-11389fce0ce2419891d6591a68f814d5

yl4579 commented 8 months ago

@GUUser91 Since StyleTTS2 is already an end-to-end model, meaning it generates waveforms directly from the text, I don’t see any use of this codec anywhere unless we don’t do end-to-end training, which may degrade the quality (though it could be faster in training).

sh-lee-prml commented 8 months ago

@yl4579 Thanks for sharing the checkpoint. Now, I'm synthesizing the speech with your model! 😀

However, I have some problems when I feed a very short reference audio to the style encoder because the fixed filter size of your style encoder. I have a simple trick to infer with short reference audio by just replicating audio before fed to style encoder. This may resolve this issue.

Could you recommend a proper value of alpha and beta for LibriTTS samples?

            # the closer the alpha is to 0, the less diversity, but the more similar it is to the reference speaker in timbre
            alpha = 0.3 # how much you want to mix the sampled style with the original style (acoustic part)

            # the closer the beta is to 0, the less diversity, but the more similar it is to the reference speaker in prosody
            beta = 0.5 # how much you want to mix the sampled style with the original style (prosodic part)

            ref = alpha * ref + (1 - alpha)  * ref_s[:, :128]
            s = beta * s + (1 - beta)  * ref_s[:, 128:]

In addition, I entirely agree that audio codec is not required for your model. The audio quality of StyleTTS 2 is already better than recent proposed 2-stage models such as Vall-E or NaturalSpeech 2 in terms of naturalness. Using audio codec will decrease the audio quality.

and I found a typo in your demo page.

https://styletts2.github.io/#libri

I could not find these sample from LibriTTS. It seems that these samples are from LibriSpeech, not LibriTTS.😉

Thanks again

yl4579 commented 8 months ago

@sh-lee-prml Thanks for your appreciation of this work! As for your problem of inference using very short clips (less than one second), you probably have to repeat the reference until it reaches the minimum length, and it could lead to potential problems as there is no such data during training (clips shore than 1 second were excluded during training). If you do need to do inference with very short references, you may have to retrain or fine-tune the model with shorter clips, possibly with repeating to accommodate the receptive field of the style encoder.

The alpha and beta are just factors that control diversity and similarity. The higher the alpha and beta, the closer it is to the sampled style (and thus less similar to the actual reference style), and vice versa. It depends on the use case, i.e., do you want more diverse samples with the same text, or do you want more similar samples to the reference? Values ranging from 0.3 to 0.5 balance diversity and similarity.

The demo page indeed shows samples from LibriSpeech, because these were reference samples taken from the Vall-E and NaturalSpeech 2 demo pages. LibriTTS here refers to the model (i.e., model trained on LibriTTS), not the testing dataset. I have marked this difference in the paper. The Table 1 shows that the testing set for zero-shot experiments was LibriSpeech instead of LibriTTS.

pawngrubber commented 8 months ago

Since you are retraining... Would you be open to sharing the model weights in checkpoints instead of waiting for it to be fully trained? @yl4579

yl4579 commented 8 months ago

@pawngrubber There are multiple stages, and it is quite inconvenient to upload the checkpoints as each one of them is around 2G big.

yeeyou commented 8 months ago

Hi, just try your new colab, it works great.

But I got a problem, when I tried to change the text to Chinese text = "如果这不是您看到的号码,请检查计算机上的代理设置。" It shows phonemizer:words count mismatch on 100.0% of the lines (1/1)

I did some google search and try to change to pinyin, well it can read, but not good, I can't understand it without looking at the text. text = "rú guǒ zhè bú shì nín kàn dào de hào mǎ, qǐng jiǎnchá jìsuànjī shàng de dàilǐ shèzhì." This is the audio generated https://soundcloud.com/wooden-tank/chinese

Can you give me some tips to get a better result, thank you

GUUser91 commented 8 months ago

@yl4579 How come I get this error message If try to finetune the LibriTTS model?

accelerate launch train_first.py --config_path ./Configs/config.yml The following values were not passed to accelerate launch and had defaults used instead: --num_processes was set to a value of 1 --num_machines was set to a value of 1 --mixed_precision was set to a value of 'no' --dynamo_backend was set to a value of 'no' To avoid this warning pass in values for each of the problematic parameters or run accelerate config. bert loaded Traceback (most recent call last): File "/home/user/StyleTTS2/train_first.py", line 444, in main() File "/home/user/StyleTTS2/venv/lib/python3.10/site-packages/click/core.py", line 1157, in call return self.main(args, kwargs) File "/home/user/StyleTTS2/venv/lib/python3.10/site-packages/click/core.py", line 1078, in main rv = self.invoke(ctx) File "/home/user/StyleTTS2/venv/lib/python3.10/site-packages/click/core.py", line 1434, in invoke return ctx.invoke(self.callback, ctx.params) File "/home/user/StyleTTS2/venv/lib/python3.10/site-packages/click/core.py", line 783, in invoke return __callback(args, **kwargs) File "/home/user/StyleTTS2/train_first.py", line 152, in main model, optimizer, start_epoch, iters = load_checkpoint(model, optimizer, config['pretrained_model'], File "/home/user/StyleTTS2/models.py", line 702, in load_checkpoint model[key].load_state_dict(params[key]) File "/home/user/StyleTTS2/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2152, in load_state_dict raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format( RuntimeError: Error(s) in loading state_dict for CustomAlbert: Missing key(s) in state_dict: "embeddings.word_embeddings.weight", "embeddings.position_embeddings.weight", "embeddings.token_type_embeddings.weight", "embeddings.LayerNorm.weight", "embeddings.LayerNorm.bias", "encoder.embedding_hidden_mapping_in.weight", "encoder.embedding_hidden_mapping_in.bias", "encoder.albert_layer_groups.0.albert_layers.0.full_layer_layer_norm.weight", "encoder.albert_layer_groups.0.albert_layers.0.full_layer_layer_norm.bias", "encoder.albert_layer_groups.0.albert_layers.0.attention.query.weight", "encoder.albert_layer_groups.0.albert_layers.0.attention.query.bias", "encoder.albert_layer_groups.0.albert_layers.0.attention.key.weight", "encoder.albert_layer_groups.0.albert_layers.0.attention.key.bias", "encoder.albert_layer_groups.0.albert_layers.0.attention.value.weight", "encoder.albert_layer_groups.0.albert_layers.0.attention.value.bias", "encoder.albert_layer_groups.0.albert_layers.0.attention.dense.weight", "encoder.albert_layer_groups.0.albert_layers.0.attention.dense.bias", "encoder.albert_layer_groups.0.albert_layers.0.attention.LayerNorm.weight", "encoder.albert_layer_groups.0.albert_layers.0.attention.LayerNorm.bias", "encoder.albert_layer_groups.0.albert_layers.0.ffn.weight", "encoder.albert_layer_groups.0.albert_layers.0.ffn.bias", "encoder.albert_layer_groups.0.albert_layers.0.ffn_output.weight", "encoder.albert_layer_groups.0.albert_layers.0.ffn_output.bias", "pooler.weight", "pooler.bias". Unexpected key(s) in state_dict: "module.embeddings.word_embeddings.weight", "module.embeddings.position_embeddings.weight", "module.embeddings.token_type_embeddings.weight", "module.embeddings.LayerNorm.weight", "module.embeddings.LayerNorm.bias", "module.encoder.embedding_hidden_mapping_in.weight", "module.encoder.embedding_hidden_mapping_in.bias", "module.encoder.albert_layer_groups.0.albert_layers.0.full_layer_layer_norm.weight", "module.encoder.albert_layer_groups.0.albert_layers.0.full_layer_layer_norm.bias", "module.encoder.albert_layer_groups.0.albert_layers.0.attention.query.weight", "module.encoder.albert_layer_groups.0.albert_layers.0.attention.query.bias", "module.encoder.albert_layer_groups.0.albert_layers.0.attention.key.weight", "module.encoder.albert_layer_groups.0.albert_layers.0.attention.key.bias", "module.encoder.albert_layer_groups.0.albert_layers.0.attention.value.weight", "module.encoder.albert_layer_groups.0.albert_layers.0.attention.value.bias", "module.encoder.albert_layer_groups.0.albert_layers.0.attention.dense.weight", "module.encoder.albert_layer_groups.0.albert_layers.0.attention.dense.bias", "module.encoder.albert_layer_groups.0.albert_layers.0.attention.LayerNorm.weight", "module.encoder.albert_layer_groups.0.albert_layers.0.attention.LayerNorm.bias", "module.encoder.albert_layer_groups.0.albert_layers.0.ffn.weight", "module.encoder.albert_layer_groups.0.albert_layers.0.ffn.bias", "module.encoder.albert_layer_groups.0.albert_layers.0.ffn_output.weight", "module.encoder.albert_layer_groups.0.albert_layers.0.ffn_output.bias", "module.pooler.weight", "module.pooler.bias". Traceback (most recent call last): File "/home/user/StyleTTS2/venv/bin/accelerate", line 8, in sys.exit(main()) File "/home/user/StyleTTS2/venv/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main args.func(args) File "/home/user/StyleTTS2/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 994, in launch_command simple_launcher(args) File "/home/user/StyleTTS2/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 636, in simple_launcher raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd) subprocess.CalledProcessError: Command '['/home/user/StyleTTS2/venv/bin/python3.10', 'train_first.py', '--config_path', './Configs/config.yml']' returned non-zero exit status 1.

WendongGan commented 8 months ago

Hi, just try your new colab, it works great.

But I got a problem, when I tried to change the text to Chinese text = "如果这不是您看到的号码,请检查计算机上的代理设置。" It shows phonemizer:words count mismatch on 100.0% of the lines (1/1)

I did some google search and try to change to pinyin, well it can read, but not good, I can't understand it without looking at the text. text = "rú guǒ zhè bú shì nín kàn dào de hào mǎ, qǐng jiǎnchá jìsuànjī shàng de dàilǐ shèzhì." This is the audio generated https://soundcloud.com/wooden-tank/chinese

Can you give me some tips to get a better result, thank you

As I know, this pretrain model does not support chinese or pinyin, only support English phone. Wish this can help you.

WendongGan commented 8 months ago

@yl4579 How come I get this error message If try to finetune the LibriTTS model?

accelerate launch train_first.py --config_path ./Configs/config.yml The following values were not passed to accelerate launch and had defaults used instead: --num_processes was set to a value of 1 --num_machines was set to a value of 1 --mixed_precision was set to a value of 'no' --dynamo_backend was set to a value of 'no' To avoid this warning pass in values for each of the problematic parameters or run accelerate config. bert loaded Traceback (most recent call last): File "/home/user/StyleTTS2/train_first.py", line 444, in main() File "/home/user/StyleTTS2/venv/lib/python3.10/site-packages/click/core.py", line 1157, in call return self.main(args, kwargs) File "/home/user/StyleTTS2/venv/lib/python3.10/site-packages/click/core.py", line 1078, in main rv = self.invoke(ctx) File "/home/user/StyleTTS2/venv/lib/python3.10/site-packages/click/core.py", line 1434, in invoke return ctx.invoke(self.callback, ctx.params) File "/home/user/StyleTTS2/venv/lib/python3.10/site-packages/click/core.py", line 783, in invoke return __callback(args, **kwargs) File "/home/user/StyleTTS2/train_first.py", line 152, in main model, optimizer, start_epoch, iters = load_checkpoint(model, optimizer, config['pretrained_model'], File "/home/user/StyleTTS2/models.py", line 702, in load_checkpoint model[key].load_state_dict(params[key]) File "/home/user/StyleTTS2/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2152, in load_state_dict raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format( RuntimeError: Error(s) in loading state_dict for CustomAlbert: Missing key(s) in state_dict: "embeddings.word_embeddings.weight", "embeddings.position_embeddings.weight", "embeddings.token_type_embeddings.weight", "embeddings.LayerNorm.weight", "embeddings.LayerNorm.bias", "encoder.embedding_hidden_mapping_in.weight", "encoder.embedding_hidden_mapping_in.bias", "encoder.albert_layer_groups.0.albert_layers.0.full_layer_layer_norm.weight", "encoder.albert_layer_groups.0.albert_layers.0.full_layer_layer_norm.bias", "encoder.albert_layer_groups.0.albert_layers.0.attention.query.weight", "encoder.albert_layer_groups.0.albert_layers.0.attention.query.bias", "encoder.albert_layer_groups.0.albert_layers.0.attention.key.weight", "encoder.albert_layer_groups.0.albert_layers.0.attention.key.bias", "encoder.albert_layer_groups.0.albert_layers.0.attention.value.weight", "encoder.albert_layer_groups.0.albert_layers.0.attention.value.bias", "encoder.albert_layer_groups.0.albert_layers.0.attention.dense.weight", "encoder.albert_layer_groups.0.albert_layers.0.attention.dense.bias", "encoder.albert_layer_groups.0.albert_layers.0.attention.LayerNorm.weight", "encoder.albert_layer_groups.0.albert_layers.0.attention.LayerNorm.bias", "encoder.albert_layer_groups.0.albert_layers.0.ffn.weight", "encoder.albert_layer_groups.0.albert_layers.0.ffn.bias", "encoder.albert_layer_groups.0.albert_layers.0.ffn_output.weight", "encoder.albert_layer_groups.0.albert_layers.0.ffn_output.bias", "pooler.weight", "pooler.bias". Unexpected key(s) in state_dict: "module.embeddings.word_embeddings.weight", "module.embeddings.position_embeddings.weight", "module.embeddings.token_type_embeddings.weight", "module.embeddings.LayerNorm.weight", "module.embeddings.LayerNorm.bias", "module.encoder.embedding_hidden_mapping_in.weight", "module.encoder.embedding_hidden_mapping_in.bias", "module.encoder.albert_layer_groups.0.albert_layers.0.full_layer_layer_norm.weight", "module.encoder.albert_layer_groups.0.albert_layers.0.full_layer_layer_norm.bias", "module.encoder.albert_layer_groups.0.albert_layers.0.attention.query.weight", "module.encoder.albert_layer_groups.0.albert_layers.0.attention.query.bias", "module.encoder.albert_layer_groups.0.albert_layers.0.attention.key.weight", "module.encoder.albert_layer_groups.0.albert_layers.0.attention.key.bias", "module.encoder.albert_layer_groups.0.albert_layers.0.attention.value.weight", "module.encoder.albert_layer_groups.0.albert_layers.0.attention.value.bias", "module.encoder.albert_layer_groups.0.albert_layers.0.attention.dense.weight", "module.encoder.albert_layer_groups.0.albert_layers.0.attention.dense.bias", "module.encoder.albert_layer_groups.0.albert_layers.0.attention.LayerNorm.weight", "module.encoder.albert_layer_groups.0.albert_layers.0.attention.LayerNorm.bias", "module.encoder.albert_layer_groups.0.albert_layers.0.ffn.weight", "module.encoder.albert_layer_groups.0.albert_layers.0.ffn.bias", "module.encoder.albert_layer_groups.0.albert_layers.0.ffn_output.weight", "module.encoder.albert_layer_groups.0.albert_layers.0.ffn_output.bias", "module.pooler.weight", "module.pooler.bias". Traceback (most recent call last): File "/home/user/StyleTTS2/venv/bin/accelerate", line 8, in sys.exit(main()) File "/home/user/StyleTTS2/venv/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main args.func(args) File "/home/user/StyleTTS2/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 994, in launch_command simple_launcher(args) File "/home/user/StyleTTS2/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 636, in simple_launcher raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd) subprocess.CalledProcessError: Command '['/home/user/StyleTTS2/venv/bin/python3.10', 'train_first.py', '--config_path', './Configs/config.yml']' returned non-zero exit status 1.

I have started to train by load the pretrained model of stage2. i find your mistake, this pretrained model is stage2 ,but your command is stage1. Wish this can help you.

GUUser91 commented 8 months ago

@WendongGan Thank you. Now I get a OOM error message. I set the batch size to 4 and batch_percentage down to 0.3. I have GeForce RTX 4060 Ti 16GB Model. Edit: Nevermind. I figured it by lowering the max_len down to 60.

GUUser91 commented 8 months ago

I'm using the finetuned model from yl4579's StyleTTS2_libritts_debug.ipynb file. I get this error message after setting it to use the finetune model

ValueError Traceback (most recent call last) Cell In[55], line 40 36 duration = torch.sigmoid(duration).sum(axis=-1) 37 pred_dur = torch.round(duration.squeeze()).clamp(min=1) ---> 40 pred_aln_trg = torch.zeros(input_lengths, int(pred_dur.sum().data)) 41 c_frame = 0 42 for i in range(pred_aln_trg.size(0)):

ValueError: cannot convert float NaN to integer

I set the finetuned model settings to this:

params_whole = torch.load("LibriTTS_debug/epoch_2nd_00004.pth", map_location='cpu') params = params_whole['net']

Edit: Nevermind again. I fixed the problem by reinstalling StyleTTS2. I finetuned the model again. The only thing I edit in yl4579's StyleTTS2_libritts_debug.ipynb file was the location of the .pth file. Then I no longer got the error message.

params_whole = torch.load("/home/user/StyleTTS2/Models/LibriTTS/epoch_2nd_00004.pth", map_location='cpu') params = params_whole['net']

eschmidbauer commented 8 months ago

@GUUser91 im having same issue trying to fine-tune can you share the StyleTTS2_libritts_debug.ipynb notebook? I can't seem to find it

GUUser91 commented 8 months ago

@eschmidbauer https://github.com/yl4579/StyleTTS2/issues/20#issuecomment-1791916800

eschmidbauer commented 8 months ago

i was just looking at that- it appears to demonstrate inference using reference audio, not fine tuning of the model

GUUser91 commented 8 months ago

@eschmidbauer I edit the config file https://github.com/yl4579/StyleTTS2/issues/20#issuecomment-1785877594 Then I finetune the model

python train_second.py --config_path ./Configs/config.yml

eschmidbauer commented 8 months ago

thanks @GUUser91 that is what i was looking for!! appreciate the help

yl4579 commented 8 months ago

I think now I have got a better model and I will upload it to the repo. The qualify is very close to the demo now. There's still some small weird issues at the end of the model for some samples (not sure what causes these), and I'm trying to investigate the issue and maybe I can have a batter model without these problems later on.

gigadunk commented 8 months ago

Heya @yl4579. I'm a little confused, I thought you trained the model used for the StyleTTS2 Demo.

Why are you retraining it if you already have the model used for the demo?

I'm probably missing context or something.

Thanks :)

yl4579 commented 8 months ago

@gigadunk The reason is I want to test if the code in the repo is working. I want to reproduce the models I used for the paper with the cleaned code, as it can be a little different from the one I had for the experiments (with Jupyter notebooks). See #1 for more context. The quality is very similar now, only there's a weird pulse (only for some reference) at the end of the speech, which can be easily fixed with [:-50] (removing the last 50 samples). I believe this is a minor issue and may be caused by some preprocessing in meldataset.py that might be a little different from the one I used for LibriTTS dataset for the paper.

gigadunk commented 8 months ago

@yl4579 Thanks for the clarification :)

I'm hyped to play around with the new model, when will it be on the repo?

yl4579 commented 8 months ago

@gigadunk I'm making the demo now. It should be up today.

yl4579 commented 8 months ago

I have pushed the demo notebook and uploaded the model. This issue should now be complete. If you find other problems of the model, please open new issues.

eschmidbauer commented 8 months ago

could you share the checkpoint for fine-tuning?

yl4579 commented 8 months ago

@eschmidbauer It’s in the README now.

eschmidbauer commented 8 months ago

ok - i tried finetuning with the libritts model and i get a state missing error. Perhaps it's the config im using no longer works with that pretrained model