Closed zero-a-z closed 1 year ago
@yl4579
Thanks for nice works
I'm looking for nice zero-shot TTS models. I also hope to use a StyleTTS 2 (LibriTTS Ver.) for a baseline model.
I think your model is much better than YourTTS or any recent LLM-based models in zero-shot TTS so I hope to compare your model as a state-of-the-art model.
It would be appreciate if you could share a plan for LibriTTS model 😃
@yl4579 hi, yl4579. Look forward to the pretrained model on LibriTTS. Be grateful to you!
Thank you for your interest in this work. I’m currently attending a few workshops and I’ll be busy with midterm exams after that, so the model release will be delayed a little bit. Expect it to arrive some time in early November.
Hey there @yl4579 , I'm hoping to test out the LibriTTS-trained StyleTTS 2 as well. Would it be possible to release the training config for the multi-speaker version so I can try and train it on my own machines before you release the pre-trained models?
P.S. Thanks for the work so far; the LJ version sounds very good.
@gigadunk Here's the configuration that I am currently using to train the LibriTTS model. The dataset is very big so the epochs need to be adjusted according to the quality of the model.
log_dir: "Models/LibriTTS"
first_stage_path: "first_stage.pth"
save_freq: 1
log_interval: 10
device: "cuda"
epochs_1st: 50 # number of epochs for first stage training (pre-training)
epochs_2nd: 30 # number of peochs for second stage training (joint training)
batch_size: 16
max_len: 300 # maximum number of frames
pretrained_model: "Models/LibriTTS/epoch_2nd_00005.pth"
second_stage_load_pretrained: true # set to true if the pre-trained model is for 2nd stage
load_only_params: false # set to true if do not want to load epoch numbers and optimizer parameters
F0_path: "Utils/JDC/bst.t7"
ASR_config: "Utils/ASR/config.yml"
ASR_path: "Utils/ASR/epoch_00080.pth"
PLBERT_dir: 'Utils/PLBERT/'
data_params:
train_data: "Data/train_list.txt"
val_data: "Data/val_list.txt"
root_path: ""
OOD_data: "Data/OOD_texts.txt"
min_length: 50 # sample until texts with this size are obtained for OOD texts
preprocess_params:
sr: 24000
spect_params:
n_fft: 2048
win_length: 1200
hop_length: 300
model_params:
multispeaker: true
dim_in: 64
hidden_dim: 512
max_conv_dim: 512
n_layer: 3
n_mels: 80
n_token: 178 # number of phoneme tokens
max_dur: 50 # maximum duration of a single phoneme
style_dim: 128 # style vector size
dropout: 0.2
# config for decoder
decoder:
type: 'hifigan' # either hifigan or istftnet
resblock_kernel_sizes: [3,7,11]
upsample_rates : [10,5,3,2]
upsample_initial_channel: 512
resblock_dilation_sizes: [[1,3,5], [1,3,5], [1,3,5]]
upsample_kernel_sizes: [20,10,6,4]
# speech language model config
slm:
model: 'microsoft/wavlm-base-plus'
sr: 16000 # sampling rate of SLM
hidden: 768 # hidden size of SLM
nlayers: 13 # number of layers of SLM
initial_channel: 64 # initial channels of SLM discriminator head
# style diffusion model config
diffusion:
embedding_mask_proba: 0.1
# transformer config
transformer:
num_layers: 3
num_heads: 8
head_features: 64
multiplier: 2
# diffusion distribution config
dist:
sigma_data: 0.2 # placeholder for estimate_sigma_data set to false
estimate_sigma_data: true # estimate sigma_data from the current batch if set to true
mean: -3.0
std: 1.0
loss_params:
lambda_mel: 5. # mel reconstruction loss
lambda_gen: 1. # generator loss
lambda_slm: 1. # slm feature matching loss
lambda_mono: 1. # monotonic alignment loss (1st stage, TMA)
lambda_s2s: 1. # sequence-to-sequence loss (1st stage, TMA)
TMA_epoch: 4 # TMA starting epoch (1st stage)
lambda_F0: 1. # F0 reconstruction loss (2nd stage)
lambda_norm: 1. # norm reconstruction loss (2nd stage)
lambda_dur: 1. # duration loss (2nd stage)
lambda_ce: 20. # duration predictor probability output CE loss (2nd stage)
lambda_sty: 1. # style reconstruction loss (2nd stage)
lambda_diff: 1. # score matching loss (2nd stage)
diff_epoch: 10 # style diffusion starting epoch (2nd stage)
joint_epoch: 15 # joint training starting epoch (2nd stage)
optimizer_params:
lr: 0.0001 # general learning rate
bert_lr: 0.00001 # learning rate for PLBERT
ft_lr: 0.00001 # learning rate for acoustic modules
slmadv_params:
min_len: 400 # minimum length of samples
max_len: 500 # maximum length of samples
batch_percentage: 0.5 # to prevent out of memory, only use half of the original batch size
iter: 20 # update the discriminator every this iterations of generator update
thresh: 5 # gradient norm above which the gradient is scaled
scale: 0.01 # gradient scaling factor for predictors from SLM discriminators
sig: 1.5 # sigma for differentiable duration modeling
Unfortunately, somebody found a mistake in the training code and informed me via email. I checked the quality of the model, and it sounds worse than the demo because of the mistake (wrong reference audio). I have fixed the mistake but I have to retrain the model from scratch. Now expect the model to be released by mid-November. Sorry for the delay. I believe the current code should produce working models now.
The current model quality is not bad though, so if you need the model now, you can download it here: https://drive.google.com/drive/folders/1ApqjyugCzr4EN2NFXa5Opfr3qcoapUPV?usp=sharing, but I can probably get a better model a couple of weeks later.
You only need to change the following code to run the inference:
def compute_style(path):
wave, sr = librosa.load(path, sr=24000)
audio, index = librosa.effects.trim(wave, top_db=30)
if sr != 24000:
audio = librosa.resample(audio, sr, 24000)
mel_tensor = preprocess(audio).to(device)
with torch.no_grad():
ref_s = model.style_encoder(mel_tensor.unsqueeze(1))
ref_p = model.predictor_encoder(mel_tensor.unsqueeze(1))
return torch.cat([ref_s, ref_p], dim=1)
reference = "Demo/1221-135767-0014.wav"
ref_s = compute_style(reference)
with torch.no_grad():
input_lengths = torch.LongTensor([tokens.shape[-1]]).to(device)
text_mask = length_to_mask(input_lengths).to(device)
t_en = model.text_encoder(tokens, input_lengths, text_mask)
bert_dur = model.bert(tokens, attention_mask=(~text_mask).int())
d_en = model.bert_encoder(bert_dur).transpose(-1, -2)
s_pred = sampler(noise = torch.randn((1, 256)).unsqueeze(1).to(device),
embedding=bert_dur,
embedding_scale=1,
features=ref_s, # reference from the same speaker as the embedding
num_steps=10).squeeze(1)
s = s_pred[:, 128:]
ref = s_pred[:, :128]
alpha = 0.3 # how much you want to mix the sampled style with the original style (acoustic part)
beta = 0.7 # how much you want to mix the sampled style with the original style (prosodic part)
ref = alpha * ref + (1 - alpha) * ref_s[:, :128]
s = beta * s + (1 - beta) * ref_s[:, 128:]
d = model.predictor.text_encoder(d_en,
s, input_lengths, text_mask)
x, _ = model.predictor.lstm(d)
duration = model.predictor.duration_proj(x)
duration = torch.sigmoid(duration).sum(axis=-1)
pred_dur = torch.round(duration.squeeze()).clamp(min=1)
pred_aln_trg = torch.zeros(input_lengths, int(pred_dur.sum().data))
c_frame = 0
for i in range(pred_aln_trg.size(0)):
pred_aln_trg[i, c_frame:c_frame + int(pred_dur[i].data)] = 1
c_frame += int(pred_dur[i].data)
# encode prosody
en = (d.transpose(-1, -2) @ pred_aln_trg.unsqueeze(0).to(device))
if model_params.decoder.type == "hifigan": # fix weird misalignment for hifigan decoder
asr_new = torch.zeros_like(en)
asr_new[:, :, 0] = en[:, :, 0]
asr_new[:, :, 1:] = en[:, :, 0:-1]
en = asr_new
F0_pred, N_pred = model.predictor.F0Ntrain(en, s)
asr = (t_en @ pred_aln_trg.unsqueeze(0).to(device))
if model_params.decoder.type == "hifigan": # fix weird misalignment for hifigan decoder
asr_new = torch.zeros_like(asr)
asr_new[:, :, 0] = asr[:, :, 0]
asr_new[:, :, 1:] = asr[:, :, 0:-1]
asr = asr_new
out = model.decoder(asr,
F0_pred, N_pred, ref.squeeze().unsqueeze(0))
Formal inference demo including reproducing audio on the demo page will come later once the better model is done.
I tested it on the colab and it works, so if you want to try it now you can use this link: https://colab.research.google.com/drive/1VENAg_TeKj5a1NYMJTSrbNLDlcIT30Sh
@yl4579 Since you're restarting the model from scratch, have you thought of implementing Descript Audio Codec? With Descript Audio Codec, you can compress 44.1 KHz audio into discrete codes at a low 8 kbps bitrate. This universal model works on all domains (speech, environment, music, etc.), making it widely applicable to generative modeling of all audio. It can be used as a drop-in replacement for EnCodec for all audio language modeling applications (such as AudioLMs, MusicLMs, MusicGen, etc.) https://github.com/descriptinc/descript-audio-codec Demo: https://descript.notion.site/Descript-Audio-Codec-11389fce0ce2419891d6591a68f814d5
@GUUser91 Since StyleTTS2 is already an end-to-end model, meaning it generates waveforms directly from the text, I don’t see any use of this codec anywhere unless we don’t do end-to-end training, which may degrade the quality (though it could be faster in training).
@yl4579 Thanks for sharing the checkpoint. Now, I'm synthesizing the speech with your model! 😀
However, I have some problems when I feed a very short reference audio to the style encoder because the fixed filter size of your style encoder. I have a simple trick to infer with short reference audio by just replicating audio before fed to style encoder. This may resolve this issue.
Could you recommend a proper value of alpha and beta for LibriTTS samples?
# the closer the alpha is to 0, the less diversity, but the more similar it is to the reference speaker in timbre
alpha = 0.3 # how much you want to mix the sampled style with the original style (acoustic part)
# the closer the beta is to 0, the less diversity, but the more similar it is to the reference speaker in prosody
beta = 0.5 # how much you want to mix the sampled style with the original style (prosodic part)
ref = alpha * ref + (1 - alpha) * ref_s[:, :128]
s = beta * s + (1 - beta) * ref_s[:, 128:]
In addition, I entirely agree that audio codec is not required for your model. The audio quality of StyleTTS 2 is already better than recent proposed 2-stage models such as Vall-E or NaturalSpeech 2 in terms of naturalness. Using audio codec will decrease the audio quality.
and I found a typo in your demo page.
https://styletts2.github.io/#libri
I could not find these sample from LibriTTS. It seems that these samples are from LibriSpeech, not LibriTTS.😉
Thanks again
@sh-lee-prml Thanks for your appreciation of this work! As for your problem of inference using very short clips (less than one second), you probably have to repeat the reference until it reaches the minimum length, and it could lead to potential problems as there is no such data during training (clips shore than 1 second were excluded during training). If you do need to do inference with very short references, you may have to retrain or fine-tune the model with shorter clips, possibly with repeating to accommodate the receptive field of the style encoder.
The alpha and beta are just factors that control diversity and similarity. The higher the alpha and beta, the closer it is to the sampled style (and thus less similar to the actual reference style), and vice versa. It depends on the use case, i.e., do you want more diverse samples with the same text, or do you want more similar samples to the reference? Values ranging from 0.3
to 0.5
balance diversity and similarity.
The demo page indeed shows samples from LibriSpeech, because these were reference samples taken from the Vall-E and NaturalSpeech 2 demo pages. LibriTTS here refers to the model (i.e., model trained on LibriTTS), not the testing dataset. I have marked this difference in the paper. The Table 1 shows that the testing set for zero-shot experiments was LibriSpeech instead of LibriTTS.
Since you are retraining... Would you be open to sharing the model weights in checkpoints instead of waiting for it to be fully trained? @yl4579
@pawngrubber There are multiple stages, and it is quite inconvenient to upload the checkpoints as each one of them is around 2G big.
Hi, just try your new colab, it works great.
But I got a problem, when I tried to change the text to Chinese
text = "如果这不是您看到的号码,请检查计算机上的代理设置。"
It shows
phonemizer:words count mismatch on 100.0% of the lines (1/1)
I did some google search and try to change to pinyin, well it can read, but not good, I can't understand it without looking at the text.
text = "rú guǒ zhè bú shì nín kàn dào de hào mǎ, qǐng jiǎnchá jìsuànjī shàng de dàilǐ shèzhì."
This is the audio generated
https://soundcloud.com/wooden-tank/chinese
Can you give me some tips to get a better result, thank you
@yl4579 How come I get this error message If try to finetune the LibriTTS model?
accelerate launch train_first.py --config_path ./Configs/config.yml The following values were not passed to
accelerate launch
and had defaults used instead:--num_processes
was set to a value of1
--num_machines
was set to a value of1
--mixed_precision
was set to a value of'no'
--dynamo_backend
was set to a value of'no'
To avoid this warning pass in values for each of the problematic parameters or runaccelerate config
. bert loaded Traceback (most recent call last): File "/home/user/StyleTTS2/train_first.py", line 444, inmain() File "/home/user/StyleTTS2/venv/lib/python3.10/site-packages/click/core.py", line 1157, in call return self.main(args, kwargs) File "/home/user/StyleTTS2/venv/lib/python3.10/site-packages/click/core.py", line 1078, in main rv = self.invoke(ctx) File "/home/user/StyleTTS2/venv/lib/python3.10/site-packages/click/core.py", line 1434, in invoke return ctx.invoke(self.callback, ctx.params) File "/home/user/StyleTTS2/venv/lib/python3.10/site-packages/click/core.py", line 783, in invoke return __callback(args, **kwargs) File "/home/user/StyleTTS2/train_first.py", line 152, in main model, optimizer, start_epoch, iters = load_checkpoint(model, optimizer, config['pretrained_model'], File "/home/user/StyleTTS2/models.py", line 702, in load_checkpoint model[key].load_state_dict(params[key]) File "/home/user/StyleTTS2/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2152, in load_state_dict raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format( RuntimeError: Error(s) in loading state_dict for CustomAlbert: Missing key(s) in state_dict: "embeddings.word_embeddings.weight", "embeddings.position_embeddings.weight", "embeddings.token_type_embeddings.weight", "embeddings.LayerNorm.weight", "embeddings.LayerNorm.bias", "encoder.embedding_hidden_mapping_in.weight", "encoder.embedding_hidden_mapping_in.bias", "encoder.albert_layer_groups.0.albert_layers.0.full_layer_layer_norm.weight", "encoder.albert_layer_groups.0.albert_layers.0.full_layer_layer_norm.bias", "encoder.albert_layer_groups.0.albert_layers.0.attention.query.weight", "encoder.albert_layer_groups.0.albert_layers.0.attention.query.bias", "encoder.albert_layer_groups.0.albert_layers.0.attention.key.weight", "encoder.albert_layer_groups.0.albert_layers.0.attention.key.bias", "encoder.albert_layer_groups.0.albert_layers.0.attention.value.weight", "encoder.albert_layer_groups.0.albert_layers.0.attention.value.bias", "encoder.albert_layer_groups.0.albert_layers.0.attention.dense.weight", "encoder.albert_layer_groups.0.albert_layers.0.attention.dense.bias", "encoder.albert_layer_groups.0.albert_layers.0.attention.LayerNorm.weight", "encoder.albert_layer_groups.0.albert_layers.0.attention.LayerNorm.bias", "encoder.albert_layer_groups.0.albert_layers.0.ffn.weight", "encoder.albert_layer_groups.0.albert_layers.0.ffn.bias", "encoder.albert_layer_groups.0.albert_layers.0.ffn_output.weight", "encoder.albert_layer_groups.0.albert_layers.0.ffn_output.bias", "pooler.weight", "pooler.bias". Unexpected key(s) in state_dict: "module.embeddings.word_embeddings.weight", "module.embeddings.position_embeddings.weight", "module.embeddings.token_type_embeddings.weight", "module.embeddings.LayerNorm.weight", "module.embeddings.LayerNorm.bias", "module.encoder.embedding_hidden_mapping_in.weight", "module.encoder.embedding_hidden_mapping_in.bias", "module.encoder.albert_layer_groups.0.albert_layers.0.full_layer_layer_norm.weight", "module.encoder.albert_layer_groups.0.albert_layers.0.full_layer_layer_norm.bias", "module.encoder.albert_layer_groups.0.albert_layers.0.attention.query.weight", "module.encoder.albert_layer_groups.0.albert_layers.0.attention.query.bias", "module.encoder.albert_layer_groups.0.albert_layers.0.attention.key.weight", "module.encoder.albert_layer_groups.0.albert_layers.0.attention.key.bias", "module.encoder.albert_layer_groups.0.albert_layers.0.attention.value.weight", "module.encoder.albert_layer_groups.0.albert_layers.0.attention.value.bias", "module.encoder.albert_layer_groups.0.albert_layers.0.attention.dense.weight", "module.encoder.albert_layer_groups.0.albert_layers.0.attention.dense.bias", "module.encoder.albert_layer_groups.0.albert_layers.0.attention.LayerNorm.weight", "module.encoder.albert_layer_groups.0.albert_layers.0.attention.LayerNorm.bias", "module.encoder.albert_layer_groups.0.albert_layers.0.ffn.weight", "module.encoder.albert_layer_groups.0.albert_layers.0.ffn.bias", "module.encoder.albert_layer_groups.0.albert_layers.0.ffn_output.weight", "module.encoder.albert_layer_groups.0.albert_layers.0.ffn_output.bias", "module.pooler.weight", "module.pooler.bias". Traceback (most recent call last): File "/home/user/StyleTTS2/venv/bin/accelerate", line 8, in sys.exit(main()) File "/home/user/StyleTTS2/venv/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main args.func(args) File "/home/user/StyleTTS2/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 994, in launch_command simple_launcher(args) File "/home/user/StyleTTS2/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 636, in simple_launcher raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd) subprocess.CalledProcessError: Command '['/home/user/StyleTTS2/venv/bin/python3.10', 'train_first.py', '--config_path', './Configs/config.yml']' returned non-zero exit status 1.
Hi, just try your new colab, it works great.
But I got a problem, when I tried to change the text to Chinese
text = "如果这不是您看到的号码,请检查计算机上的代理设置。"
It showsphonemizer:words count mismatch on 100.0% of the lines (1/1)
I did some google search and try to change to pinyin, well it can read, but not good, I can't understand it without looking at the text.
text = "rú guǒ zhè bú shì nín kàn dào de hào mǎ, qǐng jiǎnchá jìsuànjī shàng de dàilǐ shèzhì."
This is the audio generated https://soundcloud.com/wooden-tank/chineseCan you give me some tips to get a better result, thank you
As I know, this pretrain model does not support chinese or pinyin, only support English phone. Wish this can help you.
@yl4579 How come I get this error message If try to finetune the LibriTTS model?
accelerate launch train_first.py --config_path ./Configs/config.yml The following values were not passed to
accelerate launch
and had defaults used instead:--num_processes
was set to a value of1
--num_machines
was set to a value of1
--mixed_precision
was set to a value of'no'
--dynamo_backend
was set to a value of'no'
To avoid this warning pass in values for each of the problematic parameters or runaccelerate config
. bert loaded Traceback (most recent call last): File "/home/user/StyleTTS2/train_first.py", line 444, in main() File "/home/user/StyleTTS2/venv/lib/python3.10/site-packages/click/core.py", line 1157, in call return self.main(args, kwargs) File "/home/user/StyleTTS2/venv/lib/python3.10/site-packages/click/core.py", line 1078, in main rv = self.invoke(ctx) File "/home/user/StyleTTS2/venv/lib/python3.10/site-packages/click/core.py", line 1434, in invoke return ctx.invoke(self.callback, ctx.params) File "/home/user/StyleTTS2/venv/lib/python3.10/site-packages/click/core.py", line 783, in invoke return __callback(args, **kwargs) File "/home/user/StyleTTS2/train_first.py", line 152, in main model, optimizer, start_epoch, iters = load_checkpoint(model, optimizer, config['pretrained_model'], File "/home/user/StyleTTS2/models.py", line 702, in load_checkpoint model[key].load_state_dict(params[key]) File "/home/user/StyleTTS2/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2152, in load_state_dict raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format( RuntimeError: Error(s) in loading state_dict for CustomAlbert: Missing key(s) in state_dict: "embeddings.word_embeddings.weight", "embeddings.position_embeddings.weight", "embeddings.token_type_embeddings.weight", "embeddings.LayerNorm.weight", "embeddings.LayerNorm.bias", "encoder.embedding_hidden_mapping_in.weight", "encoder.embedding_hidden_mapping_in.bias", "encoder.albert_layer_groups.0.albert_layers.0.full_layer_layer_norm.weight", "encoder.albert_layer_groups.0.albert_layers.0.full_layer_layer_norm.bias", "encoder.albert_layer_groups.0.albert_layers.0.attention.query.weight", "encoder.albert_layer_groups.0.albert_layers.0.attention.query.bias", "encoder.albert_layer_groups.0.albert_layers.0.attention.key.weight", "encoder.albert_layer_groups.0.albert_layers.0.attention.key.bias", "encoder.albert_layer_groups.0.albert_layers.0.attention.value.weight", "encoder.albert_layer_groups.0.albert_layers.0.attention.value.bias", "encoder.albert_layer_groups.0.albert_layers.0.attention.dense.weight", "encoder.albert_layer_groups.0.albert_layers.0.attention.dense.bias", "encoder.albert_layer_groups.0.albert_layers.0.attention.LayerNorm.weight", "encoder.albert_layer_groups.0.albert_layers.0.attention.LayerNorm.bias", "encoder.albert_layer_groups.0.albert_layers.0.ffn.weight", "encoder.albert_layer_groups.0.albert_layers.0.ffn.bias", "encoder.albert_layer_groups.0.albert_layers.0.ffn_output.weight", "encoder.albert_layer_groups.0.albert_layers.0.ffn_output.bias", "pooler.weight", "pooler.bias". Unexpected key(s) in state_dict: "module.embeddings.word_embeddings.weight", "module.embeddings.position_embeddings.weight", "module.embeddings.token_type_embeddings.weight", "module.embeddings.LayerNorm.weight", "module.embeddings.LayerNorm.bias", "module.encoder.embedding_hidden_mapping_in.weight", "module.encoder.embedding_hidden_mapping_in.bias", "module.encoder.albert_layer_groups.0.albert_layers.0.full_layer_layer_norm.weight", "module.encoder.albert_layer_groups.0.albert_layers.0.full_layer_layer_norm.bias", "module.encoder.albert_layer_groups.0.albert_layers.0.attention.query.weight", "module.encoder.albert_layer_groups.0.albert_layers.0.attention.query.bias", "module.encoder.albert_layer_groups.0.albert_layers.0.attention.key.weight", "module.encoder.albert_layer_groups.0.albert_layers.0.attention.key.bias", "module.encoder.albert_layer_groups.0.albert_layers.0.attention.value.weight", "module.encoder.albert_layer_groups.0.albert_layers.0.attention.value.bias", "module.encoder.albert_layer_groups.0.albert_layers.0.attention.dense.weight", "module.encoder.albert_layer_groups.0.albert_layers.0.attention.dense.bias", "module.encoder.albert_layer_groups.0.albert_layers.0.attention.LayerNorm.weight", "module.encoder.albert_layer_groups.0.albert_layers.0.attention.LayerNorm.bias", "module.encoder.albert_layer_groups.0.albert_layers.0.ffn.weight", "module.encoder.albert_layer_groups.0.albert_layers.0.ffn.bias", "module.encoder.albert_layer_groups.0.albert_layers.0.ffn_output.weight", "module.encoder.albert_layer_groups.0.albert_layers.0.ffn_output.bias", "module.pooler.weight", "module.pooler.bias". Traceback (most recent call last): File "/home/user/StyleTTS2/venv/bin/accelerate", line 8, in sys.exit(main()) File "/home/user/StyleTTS2/venv/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main args.func(args) File "/home/user/StyleTTS2/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 994, in launch_command simple_launcher(args) File "/home/user/StyleTTS2/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 636, in simple_launcher raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd) subprocess.CalledProcessError: Command '['/home/user/StyleTTS2/venv/bin/python3.10', 'train_first.py', '--config_path', './Configs/config.yml']' returned non-zero exit status 1.
I have started to train by load the pretrained model of stage2. i find your mistake, this pretrained model is stage2 ,but your command is stage1. Wish this can help you.
@WendongGan Thank you. Now I get a OOM error message. I set the batch size to 4 and batch_percentage down to 0.3. I have GeForce RTX 4060 Ti 16GB Model. Edit: Nevermind. I figured it by lowering the max_len down to 60.
I'm using the finetuned model from yl4579's StyleTTS2_libritts_debug.ipynb file. I get this error message after setting it to use the finetune model
ValueError Traceback (most recent call last) Cell In[55], line 40 36 duration = torch.sigmoid(duration).sum(axis=-1) 37 pred_dur = torch.round(duration.squeeze()).clamp(min=1) ---> 40 pred_aln_trg = torch.zeros(input_lengths, int(pred_dur.sum().data)) 41 c_frame = 0 42 for i in range(pred_aln_trg.size(0)):
ValueError: cannot convert float NaN to integer
I set the finetuned model settings to this:
params_whole = torch.load("LibriTTS_debug/epoch_2nd_00004.pth", map_location='cpu') params = params_whole['net']
Edit: Nevermind again. I fixed the problem by reinstalling StyleTTS2. I finetuned the model again. The only thing I edit in yl4579's StyleTTS2_libritts_debug.ipynb file was the location of the .pth file. Then I no longer got the error message.
params_whole = torch.load("/home/user/StyleTTS2/Models/LibriTTS/epoch_2nd_00004.pth", map_location='cpu') params = params_whole['net']
@GUUser91 im having same issue trying to fine-tune
can you share the StyleTTS2_libritts_debug.ipynb
notebook? I can't seem to find it
i was just looking at that- it appears to demonstrate inference using reference audio, not fine tuning of the model
@eschmidbauer I edit the config file https://github.com/yl4579/StyleTTS2/issues/20#issuecomment-1785877594 Then I finetune the model
python train_second.py --config_path ./Configs/config.yml
thanks @GUUser91 that is what i was looking for!! appreciate the help
I think now I have got a better model and I will upload it to the repo. The qualify is very close to the demo now. There's still some small weird issues at the end of the model for some samples (not sure what causes these), and I'm trying to investigate the issue and maybe I can have a batter model without these problems later on.
Heya @yl4579. I'm a little confused, I thought you trained the model used for the StyleTTS2 Demo.
Why are you retraining it if you already have the model used for the demo?
I'm probably missing context or something.
Thanks :)
@gigadunk The reason is I want to test if the code in the repo is working. I want to reproduce the models I used for the paper with the cleaned code, as it can be a little different from the one I had for the experiments (with Jupyter notebooks). See #1 for more context. The quality is very similar now, only there's a weird pulse (only for some reference) at the end of the speech, which can be easily fixed with [:-50] (removing the last 50 samples). I believe this is a minor issue and may be caused by some preprocessing in meldataset.py that might be a little different from the one I used for LibriTTS dataset for the paper.
@yl4579 Thanks for the clarification :)
I'm hyped to play around with the new model, when will it be on the repo?
@gigadunk I'm making the demo now. It should be up today.
I have pushed the demo notebook and uploaded the model. This issue should now be complete. If you find other problems of the model, please open new issues.
could you share the checkpoint for fine-tuning?
@eschmidbauer It’s in the README now.
ok - i tried finetuning with the libritts model and i get a state missing error. Perhaps it's the config im using no longer works with that pretrained model
Hello, thank you for sharing such interesting work! May I know what your plans are for sharing the pretrained model trained on LibriTTS? Thanks! :)