yl4579 / StarGANv2-VC

StarGANv2-VC: A Diverse, Unsupervised, Non-parallel Framework for Natural-Sounding Voice Conversion
MIT License
487 stars 108 forks source link

How to improve multilingual singing voice conversion? #48

Closed MMMMichaelzhang closed 2 years ago

MMMMichaelzhang commented 2 years ago

The performance of speaking conversion is good, and the singing conversion is not ideal. If I do singing voice conversion, can you teach me how to use hifigan, hififan also has a pre-model with the same parameters. Do you have any plans to upgrade the singing conversion next?

yl4579 commented 2 years ago

Sorry for the late reply. I was pretty busy at the end of my semester. I'm interested in singing conversion with the same architecture, especially with Hifi-GAN. Actually, I already have some idea of how to improve the results.

Hifi-GAN does significantly improve the results especially if you fine-tune it. The pre-trained Hifi-GAN likely won't work because the preprocessing is different. I will make my pre-trained HifiGAN available soon with the same preprocessing as this repo.

Other ways of improving the results include some tricks of incorporating the F0 features and some architecture changes. I will work more on it and maybe write a new paper later.

MMMMichaelzhang commented 2 years ago

if I want to train hifigan,the config file is like this:

allow_cache: true
batch_max_steps: 8400
batch_size: 16
config: conf/hifigan.v1.yaml
dev_dumpdir: dump/dev/norm
dev_feats_scp: null
dev_segments: null
dev_wav_scp: null
discriminator_adv_loss_params:
  average_by_discriminators: false
discriminator_grad_norm: -1
discriminator_optimizer_params:
  betas:
  - 0.5
  - 0.9
  lr: 0.0002
  weight_decay: 0.0
discriminator_optimizer_type: Adam
discriminator_params:
  follow_official_norm: true
  period_discriminator_params:
    bias: true
    channels: 32
    downsample_scales:
    - 3
    - 3
    - 3
    - 3
    - 1
    in_channels: 1
    kernel_sizes:
    - 5
    - 3
    max_downsample_channels: 1024
    nonlinear_activation: LeakyReLU
    nonlinear_activation_params:
      negative_slope: 0.1
    out_channels: 1
    use_spectral_norm: false
    use_weight_norm: true
  periods:
  - 2
  - 3
  - 5
  - 7
  - 11
  scale_discriminator_params:
    bias: true
    channels: 128
    downsample_scales:
    - 4
    - 4
    - 4
    - 4
    - 1
    in_channels: 1
    kernel_sizes:
    - 15
    - 41
    - 5
    - 3
    max_downsample_channels: 1024
    max_groups: 16
    nonlinear_activation: LeakyReLU
    nonlinear_activation_params:
      negative_slope: 0.1
    out_channels: 1
  scale_downsample_pooling: AvgPool1d
  scale_downsample_pooling_params:
    kernel_size: 4
    padding: 2
    stride: 2
  scales: 3
discriminator_scheduler_params:
  gamma: 0.5
  milestones:
  - 200000
  - 400000
  - 600000
  - 800000
discriminator_scheduler_type: MultiStepLR
discriminator_train_start_steps: 0
discriminator_type: HiFiGANMultiScaleMultiPeriodDiscriminator
distributed: false
eval_interval_steps: 1000
feat_match_loss_params:
  average_by_discriminators: false
  average_by_layers: false
  include_final_outputs: false
fft_size: 2048
fmax: 7600
fmin: 80
format: hdf5
generator_adv_loss_params:
  average_by_discriminators: false
generator_grad_norm: -1
generator_optimizer_params:
  betas:
  - 0.5
  - 0.9
  lr: 0.0002
  weight_decay: 0.0
generator_optimizer_type: Adam
generator_params:
  bias: true
  channels: 512
  in_channels: 80
  kernel_size: 7
  nonlinear_activation: LeakyReLU
  nonlinear_activation_params:
    negative_slope: 0.1
  out_channels: 1
  resblock_dilations:
  - - 1
    - 3
    - 5
  - - 1
    - 3
    - 5
  - - 1
    - 3
    - 5
  resblock_kernel_sizes:
  - 3
  - 7
  - 11
  upsample_kernal_sizes:
  - 10
  - 10
  - 8
  - 6
  upsample_scales:
  - 5
  - 5
  - 4
  - 3
  use_additional_convs: true
  use_weight_norm: true
generator_scheduler_params:
  gamma: 0.5
  milestones:
  - 200000
  - 400000
  - 600000
  - 800000
generator_scheduler_type: MultiStepLR
generator_train_start_steps: 1
generator_type: HiFiGANGenerator
global_gain_scale: 1.0
hop_size: 300
lambda_adv: 1.0
lambda_aux: 45.0
lambda_feat_match: 2.0
log_interval_steps: 100
mel_loss_params:
  fft_size: 2048
  fmax: 12000
  fmin: 0
  fs: 24000
  hop_size: 300
  log_base: null
  num_mels: 80
  win_length: 1200
  window: hann
num_mels: 80
num_save_intermediate_results: 4
num_workers: 2
outdir: exp/train_nodev_csmsc_hifigan.v1
pin_memory: true
pretrain: ''
rank: 0
remove_short_samples: false
resume: exp/train_nodev_csmsc_hifigan.v1/checkpoint-2370000steps.pkl
sampling_rate: 24000
save_interval_steps: 10000
train_dumpdir: dump/train_nodev/norm
train_feats_scp: null
train_max_steps: 2500000
train_segments: null
train_wav_scp: null
trim_frame_size: 1024
trim_hop_size: 256
trim_silence: false
trim_threshold_in_db: 20
use_feat_match_loss: true
use_mel_loss: true
use_stft_loss: false
verbose: 1
version: 0.5.1
win_length: 1200
window: hann

what should i change to make it suitable for this project? @yl4579

skol101 commented 2 years ago

I'm also using HifiGan, but I decided to finetune from the 2.5m step pre-trained model, should be quicker, I think. But then there's this question I ponder about https://github.com/kan-bayashi/ParallelWaveGAN/issues/373

MMMMichaelzhang commented 2 years ago

The pretrained model's preprocessing is different,does it work ?@skol101

skol101 commented 2 years ago

pretrained hifigan model: 24k | 80-7600 | 2048 / 300 / 1200

my StarGANv2 config preprocess params

preprocess_params:
  sr: 24000
  spect_params:
    n_fft: 2048
    win_length: 1200
    hop_length: 300

Or you're saying that we cannot use pre-trained HifiGan model because its dataset was normalized during preprocessing using different algorithm, instead of how it's proposed here:

mel_tensor = (torch.log(1e-5 + mel_tensor) - mean) / std
MMMMichaelzhang commented 2 years ago

yes,I mean the preprocessing is different :

to_mel = torchaudio.transforms.MelSpectrogram(
     n_mels=80, n_fft=2048, win_length=1200, hop_length=300)
mean, std = -4, 4

def preprocess(wave):
    wave_tensor = torch.from_numpy(wave).float()
    mel_tensor = to_mel(wave_tensor)
    # mel_tensor = (torch.log(mel_tensor.unsqueeze(0)) - mean) / std
    mel_tensor = (torch.log(1e-5 + mel_tensor.unsqueeze(0)) - mean) / std
    return mel_tensor

Does your hifigan model work?@skol101

skol101 commented 2 years ago

Actually after fine tuning for 50k steps from pretrained model I realised something amiss and decided to train Hifigan from the scratch just on my dataset.

MMMMichaelzhang commented 2 years ago

What changes have you made to train hifigan, such as config, and the proprecessing of hifigan? @skol101

Kreevoz commented 2 years ago

Guys, it would be best if you'd discuss matters that are about the vocoder training on the appropriate repo, which is https://github.com/kan-bayashi/ParallelWaveGAN and not repurpose closed issues for this. Or you know, discord each other.

Regarding preprocessing changes you can read some of the older issues such as #8 as this is a question that cropped up a few times.