yl4579 / StyleTTS2

StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models
MIT License
4.99k stars 422 forks source link

ValueError: a must be greater than 0 unless no samples are taken #132

Closed bobo-paopao closed 11 months ago

bobo-paopao commented 11 months ago

I don’t know why this error is reported. Here is my entire error display and the contents of the configuration file.

Traceback (most recent call last): File "train_finetune.py", line 709, in main() File "/home/admin/.local/lib/python3.8/site-packages/click/core.py", line 1157, in call return self.main(args, kwargs) File "/home/admin/.local/lib/python3.8/site-packages/click/core.py", line 1078, in main rv = self.invoke(ctx) File "/home/admin/.local/lib/python3.8/site-packages/click/core.py", line 1434, in invoke return ctx.invoke(self.callback, ctx.params) File "/home/admin/.local/lib/python3.8/site-packages/click/core.py", line 783, in invoke return __callback(args, **kwargs) File "train_finetune.py", line 260, in main for i, batch in enumerate(train_dataloader): File "/home/admin/.local/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 628, in next data = self._next_data() File "/home/admin/.local/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1333, in _next_data return self._process_data(data) File "/home/admin/.local/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1359, in _process_data data.reraise() File "/home/admin/.local/lib/python3.8/site-packages/torch/_utils.py", line 543, in reraise raise exception ValueError: Caught ValueError in DataLoader worker process 0. Original Traceback (most recent call last): File "/home/admin/.local/lib/python3.8/site-packages/torch/utils/data/_utils/worker.py", line 302, in _worker_loop data = fetcher.fetch(index) File "/home/admin/.local/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 58, in fetch data = [self.dataset[idx] for idx in possibly_batched_index] File "/home/admin/.local/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 58, in data = [self.dataset[idx] for idx in possibly_batched_index] File "/home/admin/jinbo/styletts2/StyleTTS2/meldataset.py", line 119, in getitem ref_data = (self.df[self.df[2] == str(speaker_id)]).sample(n=1).iloc[0].tolist() File "/home/admin/.local/lib/python3.8/site-packages/pandas/core/generic.py", line 5858, in sample sampled_indices = sample.sample(obj_len, size, replace, weights, rs) File "/home/admin/.local/lib/python3.8/site-packages/pandas/core/sample.py", line 151, in sample return random_state.choice(obj_len, size=size, replace=replace, p=weights).astype( File "mtrand.pyx", line 909, in numpy.random.mtrand.RandomState.choice ValueError: a must be greater than 0 unless no samples are taken

config_ft.yml

log_dir: "Models/LJSpeech"

log_dir: "Models/LibriTTS" save_freq: 5 log_interval: 10 device: "cuda" epochs: 50 # number of finetuning epoch (1 hour of data) batch_size: 8 max_len: 400 # maximum number of frames pretrained_model: "Models/LibriTTS/epochs_2nd_00020.pth" second_stage_load_pretrained: true # set to true if the pre-trained model is for 2nd stage load_only_params: true # set to true if do not want to load epoch numbers and optimizer parameters

F0_path: "Utils/JDC/bst.t7" ASR_config: "Utils/ASR/config.yml" ASR_path: "Utils/ASR/epoch_00080.pth" PLBERT_dir: 'Utils/PLBERT/'

data_params: train_data: "Data/ESD_train.txt" val_data: "Data/ESD_val.txt" root_path: "ESD/wavs" OOD_data: "Data/OOD_texts.txt" min_length: 50 # sample until texts with this size are obtained for OOD texts

preprocess_params: sr: 24000 spect_params: n_fft: 2048 win_length: 1200 hop_length: 300

model_params: multispeaker: true

dim_in: 64 hidden_dim: 512 max_conv_dim: 512 n_layer: 3 n_mels: 80

n_token: 178 # number of phoneme tokens max_dur: 50 # maximum duration of a single phoneme style_dim: 128 # style vector size

dropout: 0.2

config for decoder

decoder: type: 'hifigan' # either hifigan or istftnet resblock_kernel_sizes: [3,7,11] upsample_rates : [10,5,3,2] upsample_initial_channel: 512 resblock_dilation_sizes: [[1,3,5], [1,3,5], [1,3,5]] upsample_kernel_sizes: [20,10,6,4]

speech language model config

slm: model: 'microsoft/wavlm-base-plus' sr: 16000 # sampling rate of SLM hidden: 768 # hidden size of SLM nlayers: 13 # number of layers of SLM initial_channel: 64 # initial channels of SLM discriminator head

style diffusion model config

diffusion: embedding_mask_proba: 0.1

transformer config

transformer:
  num_layers: 3
  num_heads: 8
  head_features: 64
  multiplier: 2

# diffusion distribution config
dist:
  sigma_data: 0.2 # placeholder for estimate_sigma_data set to false
  estimate_sigma_data: true # estimate sigma_data from the current batch if set to true
  mean: -3.0
  std: 1.0

loss_params: lambda_mel: 5. # mel reconstruction loss lambda_gen: 1. # generator loss lambda_slm: 1. # slm feature matching loss

lambda_mono: 1. # monotonic alignment loss (TMA)
lambda_s2s: 1. # sequence-to-sequence loss (TMA)

lambda_F0: 1. # F0 reconstruction loss
lambda_norm: 1. # norm reconstruction loss
lambda_dur: 1. # duration loss
lambda_ce: 20. # duration predictor probability output CE loss
lambda_sty: 1. # style reconstruction loss
lambda_diff: 1. # score matching loss

diff_epoch: 10 # style diffusion starting epoch
joint_epoch: 30 # joint training starting epoch

optimizer_params: lr: 0.0001 # general learning rate bert_lr: 0.00001 # learning rate for PLBERT ft_lr: 0.0001 # learning rate for acoustic modules

slmadv_params: min_len: 400 # minimum length of samples max_len: 500 # maximum length of samples batch_percentage: 0.5 # to prevent out of memory, only use half of the original batch size iter: 10 # update the discriminator every this iterations of generator update thresh: 5 # gradient norm above which the gradient is scaled scale: 0.01 # gradient scaling factor for predictors from SLM discriminators sig: 1.5 # sigma for differentiable duration modeling

yl4579 commented 11 months ago

Can I see your dataset? train list or val list?

kingkong135 commented 11 months ago

Can I see your dataset? train list or val list?

i have a same problem. when i train from scratch only 1 speaker

yl4579 commented 11 months ago

@kingkong135 You need to put a speaker number at the very end, like file|text|0, if you only have one speaker.

bobo-paopao commented 11 months ago

Sorry, I have been too busy recently and forgot to reply. It is true that my data format is caused by the problem, I used the wav|text format data before. In addition, I would like to ask if I use one speaker but with four emotions. Can this be fine-tuned? Or I can also use a multi-speaker data set, and each speaker has four emotions. Does such a data set also support fine-tuning? @yl4579

yl4579 commented 11 months ago

@bobo-paopao Yes, you can finetune with multiple emotions.

SaltedSlark commented 5 months ago

@bobo-paopao Hi, how is it going after you trained on ESD dataset? Is the results sounds good?