RuntimeError cuDNN when training on custom dataset.

Thank you for this fantastic project!

I tried to train on my custom dataset, then got some weird runtime errors. All setups are the same as the original repo except a custom config file and a custom dataset. The size of the dataset is quite small so I have directly added it into my repo. You can clone my repo if you want: https://github.com/kynthesis/StarGANv2-VNVC (https://github.com/kynthesis/StarGANv2-VNVC/commit/bf969b63c7c7fe6a0b74f7ff20193e9111959641)

On the same Python environment (python 3.8 and3.9), errors only occur on my custom dataset, original dataset is totally fine. num_domains is set 4, and the custom dataset is converted from 16000 Hz to 24000 Hz using Voxengo r8brain Free.

My custom config: config_vivos.yml

log_dir: "Models/VIVOS"
save_freq: 2
device: "cuda"
epochs: 150
batch_size: 5
pretrained_model: ""
load_only_params: false
fp16_run: true

train_data: "vivos/train_list.txt"
val_data: "vivos/val_list.txt"

F0_path: "Utils/JDC/bst.t7"
ASR_config: "Utils/ASR/config.yml"
ASR_path: "Utils/ASR/epoch_00100.pth"

preprocess_params:
  sr: 24000
  spect_params:
    n_fft: 2048
    win_length: 1200
    hop_length: 300

model_params:
  dim_in: 64
  style_dim: 64
  latent_dim: 16
  num_domains: 4
  max_conv_dim: 512
  n_repeat: 4
  w_hpf: 0
  F0_channel: 256

loss_params:
  g_loss:
    lambda_sty: 1.
    lambda_cyc: 5.
    lambda_ds: 1.
    lambda_norm: 1.
    lambda_asr: 10.
    lambda_f0: 5.
    lambda_f0_sty: 0.1
    lambda_adv: 2.
    lambda_adv_cls: 0.5
    norm_bias: 0.5
  d_loss:
    lambda_reg: 1.
    lambda_adv_cls: 0.1
    lambda_con_reg: 10.

  adv_cls_epoch: 50
  con_reg_epoch: 30

optimizer_params:
  lr: 0.0001

My custom dataset (converted from 16000 Hz to 24000 Hz) -> google drive ~ vivos lite ~ 200MB

Encountered runtime error

{'max_lr': 0.0001, 'pct_start': 0.0, 'epochs': 150, 'steps_per_epoch': 144}
{'max_lr': 2e-06, 'pct_start': 0.0, 'epochs': 150, 'steps_per_epoch': 144}
{'max_lr': 0.0001, 'pct_start': 0.0, 'epochs': 150, 'steps_per_epoch': 144}
{'max_lr': 0.0001, 'pct_start': 0.0, 'epochs': 150, 'steps_per_epoch': 144}
{'max_lr': 0.0001, 'pct_start': 0.0, 'epochs': 150, 'steps_per_epoch': 144}
{'max_lr': 0.0001, 'pct_start': 0.0, 'epochs': 150, 'steps_per_epoch': 144}
[train]:   0%|                                                                         | 0/144 [00:01<?, ?it/s]
Traceback (most recent call last):
  File "train.py", line 156, in <module>
    main()
  File "/home/khoa/anaconda3/envs/VN38/lib/python3.8/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/home/khoa/anaconda3/envs/VN38/lib/python3.8/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/home/khoa/anaconda3/envs/VN38/lib/python3.8/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/khoa/anaconda3/envs/VN38/lib/python3.8/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "train.py", line 125, in main
    train_results = trainer._train_epoch()
  File "/home/khoa/dev/StarGANv2-VNVC/trainer.py", line 172, in _train_epoch
    d_loss, d_losses_latent = compute_d_loss(self.model, self.args.d_loss, x_real, y_org, y_trg, z_trg=z_trg, use_adv_cls=use_adv_cls, use_con_reg=use_con_reg)
  File "/home/khoa/dev/StarGANv2-VNVC/losses.py", line 24, in compute_d_loss
    loss_reg = r1_reg(out, x_real)
  File "/home/khoa/dev/StarGANv2-VNVC/losses.py", line 191, in r1_reg
    grad_dout = torch.autograd.grad(
  File "/home/khoa/anaconda3/envs/VN38/lib/python3.8/site-packages/torch/autograd/__init__.py", line 300, in grad
    return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [0,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [3,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.

I really hope that you can take a look at this issue! Khoa

yl4579 / StarGANv2-VC

RuntimeError cuDNN when training on custom dataset. #67