Closed bobo-paopao closed 11 months ago
Can I see your dataset? train list or val list?
Can I see your dataset? train list or val list?
i have a same problem. when i train from scratch only 1 speaker
@kingkong135 You need to put a speaker number at the very end, like file|text|0
, if you only have one speaker.
Sorry, I have been too busy recently and forgot to reply. It is true that my data format is caused by the problem, I used the wav|text format data before. In addition, I would like to ask if I use one speaker but with four emotions. Can this be fine-tuned? Or I can also use a multi-speaker data set, and each speaker has four emotions. Does such a data set also support fine-tuning? @yl4579
@bobo-paopao Yes, you can finetune with multiple emotions.
@bobo-paopao Hi, how is it going after you trained on ESD dataset? Is the results sounds good?
I don’t know why this error is reported. Here is my entire error display and the contents of the configuration file.
Traceback (most recent call last): File "train_finetune.py", line 709, in
main()
File "/home/admin/.local/lib/python3.8/site-packages/click/core.py", line 1157, in call
return self.main(args, kwargs)
File "/home/admin/.local/lib/python3.8/site-packages/click/core.py", line 1078, in main
rv = self.invoke(ctx)
File "/home/admin/.local/lib/python3.8/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, ctx.params)
File "/home/admin/.local/lib/python3.8/site-packages/click/core.py", line 783, in invoke
return __callback(args, **kwargs)
File "train_finetune.py", line 260, in main
for i, batch in enumerate(train_dataloader):
File "/home/admin/.local/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 628, in next
data = self._next_data()
File "/home/admin/.local/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1333, in _next_data
return self._process_data(data)
File "/home/admin/.local/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1359, in _process_data
data.reraise()
File "/home/admin/.local/lib/python3.8/site-packages/torch/_utils.py", line 543, in reraise
raise exception
ValueError: Caught ValueError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/home/admin/.local/lib/python3.8/site-packages/torch/utils/data/_utils/worker.py", line 302, in _worker_loop
data = fetcher.fetch(index)
File "/home/admin/.local/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 58, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/admin/.local/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 58, in
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/admin/jinbo/styletts2/StyleTTS2/meldataset.py", line 119, in getitem
ref_data = (self.df[self.df[2] == str(speaker_id)]).sample(n=1).iloc[0].tolist()
File "/home/admin/.local/lib/python3.8/site-packages/pandas/core/generic.py", line 5858, in sample
sampled_indices = sample.sample(obj_len, size, replace, weights, rs)
File "/home/admin/.local/lib/python3.8/site-packages/pandas/core/sample.py", line 151, in sample
return random_state.choice(obj_len, size=size, replace=replace, p=weights).astype(
File "mtrand.pyx", line 909, in numpy.random.mtrand.RandomState.choice
ValueError: a must be greater than 0 unless no samples are taken
config_ft.yml
log_dir: "Models/LJSpeech"
log_dir: "Models/LibriTTS" save_freq: 5 log_interval: 10 device: "cuda" epochs: 50 # number of finetuning epoch (1 hour of data) batch_size: 8 max_len: 400 # maximum number of frames pretrained_model: "Models/LibriTTS/epochs_2nd_00020.pth" second_stage_load_pretrained: true # set to true if the pre-trained model is for 2nd stage load_only_params: true # set to true if do not want to load epoch numbers and optimizer parameters
F0_path: "Utils/JDC/bst.t7" ASR_config: "Utils/ASR/config.yml" ASR_path: "Utils/ASR/epoch_00080.pth" PLBERT_dir: 'Utils/PLBERT/'
data_params: train_data: "Data/ESD_train.txt" val_data: "Data/ESD_val.txt" root_path: "ESD/wavs" OOD_data: "Data/OOD_texts.txt" min_length: 50 # sample until texts with this size are obtained for OOD texts
preprocess_params: sr: 24000 spect_params: n_fft: 2048 win_length: 1200 hop_length: 300
model_params: multispeaker: true
dim_in: 64 hidden_dim: 512 max_conv_dim: 512 n_layer: 3 n_mels: 80
n_token: 178 # number of phoneme tokens max_dur: 50 # maximum duration of a single phoneme style_dim: 128 # style vector size
dropout: 0.2
config for decoder
decoder: type: 'hifigan' # either hifigan or istftnet resblock_kernel_sizes: [3,7,11] upsample_rates : [10,5,3,2] upsample_initial_channel: 512 resblock_dilation_sizes: [[1,3,5], [1,3,5], [1,3,5]] upsample_kernel_sizes: [20,10,6,4]
speech language model config
slm: model: 'microsoft/wavlm-base-plus' sr: 16000 # sampling rate of SLM hidden: 768 # hidden size of SLM nlayers: 13 # number of layers of SLM initial_channel: 64 # initial channels of SLM discriminator head
style diffusion model config
diffusion: embedding_mask_proba: 0.1
transformer config
loss_params: lambda_mel: 5. # mel reconstruction loss lambda_gen: 1. # generator loss lambda_slm: 1. # slm feature matching loss
optimizer_params: lr: 0.0001 # general learning rate bert_lr: 0.00001 # learning rate for PLBERT ft_lr: 0.0001 # learning rate for acoustic modules
slmadv_params: min_len: 400 # minimum length of samples max_len: 500 # maximum length of samples batch_percentage: 0.5 # to prevent out of memory, only use half of the original batch size iter: 10 # update the discriminator every this iterations of generator update thresh: 5 # gradient norm above which the gradient is scaled scale: 0.01 # gradient scaling factor for predictors from SLM discriminators sig: 1.5 # sigma for differentiable duration modeling