Closed MMMMichaelzhang closed 2 years ago
Sorry for the late reply. I was pretty busy at the end of my semester. I'm interested in singing conversion with the same architecture, especially with Hifi-GAN. Actually, I already have some idea of how to improve the results.
Hifi-GAN does significantly improve the results especially if you fine-tune it. The pre-trained Hifi-GAN likely won't work because the preprocessing is different. I will make my pre-trained HifiGAN available soon with the same preprocessing as this repo.
Other ways of improving the results include some tricks of incorporating the F0 features and some architecture changes. I will work more on it and maybe write a new paper later.
if I want to train hifigan,the config file is like this:
allow_cache: true
batch_max_steps: 8400
batch_size: 16
config: conf/hifigan.v1.yaml
dev_dumpdir: dump/dev/norm
dev_feats_scp: null
dev_segments: null
dev_wav_scp: null
discriminator_adv_loss_params:
average_by_discriminators: false
discriminator_grad_norm: -1
discriminator_optimizer_params:
betas:
- 0.5
- 0.9
lr: 0.0002
weight_decay: 0.0
discriminator_optimizer_type: Adam
discriminator_params:
follow_official_norm: true
period_discriminator_params:
bias: true
channels: 32
downsample_scales:
- 3
- 3
- 3
- 3
- 1
in_channels: 1
kernel_sizes:
- 5
- 3
max_downsample_channels: 1024
nonlinear_activation: LeakyReLU
nonlinear_activation_params:
negative_slope: 0.1
out_channels: 1
use_spectral_norm: false
use_weight_norm: true
periods:
- 2
- 3
- 5
- 7
- 11
scale_discriminator_params:
bias: true
channels: 128
downsample_scales:
- 4
- 4
- 4
- 4
- 1
in_channels: 1
kernel_sizes:
- 15
- 41
- 5
- 3
max_downsample_channels: 1024
max_groups: 16
nonlinear_activation: LeakyReLU
nonlinear_activation_params:
negative_slope: 0.1
out_channels: 1
scale_downsample_pooling: AvgPool1d
scale_downsample_pooling_params:
kernel_size: 4
padding: 2
stride: 2
scales: 3
discriminator_scheduler_params:
gamma: 0.5
milestones:
- 200000
- 400000
- 600000
- 800000
discriminator_scheduler_type: MultiStepLR
discriminator_train_start_steps: 0
discriminator_type: HiFiGANMultiScaleMultiPeriodDiscriminator
distributed: false
eval_interval_steps: 1000
feat_match_loss_params:
average_by_discriminators: false
average_by_layers: false
include_final_outputs: false
fft_size: 2048
fmax: 7600
fmin: 80
format: hdf5
generator_adv_loss_params:
average_by_discriminators: false
generator_grad_norm: -1
generator_optimizer_params:
betas:
- 0.5
- 0.9
lr: 0.0002
weight_decay: 0.0
generator_optimizer_type: Adam
generator_params:
bias: true
channels: 512
in_channels: 80
kernel_size: 7
nonlinear_activation: LeakyReLU
nonlinear_activation_params:
negative_slope: 0.1
out_channels: 1
resblock_dilations:
- - 1
- 3
- 5
- - 1
- 3
- 5
- - 1
- 3
- 5
resblock_kernel_sizes:
- 3
- 7
- 11
upsample_kernal_sizes:
- 10
- 10
- 8
- 6
upsample_scales:
- 5
- 5
- 4
- 3
use_additional_convs: true
use_weight_norm: true
generator_scheduler_params:
gamma: 0.5
milestones:
- 200000
- 400000
- 600000
- 800000
generator_scheduler_type: MultiStepLR
generator_train_start_steps: 1
generator_type: HiFiGANGenerator
global_gain_scale: 1.0
hop_size: 300
lambda_adv: 1.0
lambda_aux: 45.0
lambda_feat_match: 2.0
log_interval_steps: 100
mel_loss_params:
fft_size: 2048
fmax: 12000
fmin: 0
fs: 24000
hop_size: 300
log_base: null
num_mels: 80
win_length: 1200
window: hann
num_mels: 80
num_save_intermediate_results: 4
num_workers: 2
outdir: exp/train_nodev_csmsc_hifigan.v1
pin_memory: true
pretrain: ''
rank: 0
remove_short_samples: false
resume: exp/train_nodev_csmsc_hifigan.v1/checkpoint-2370000steps.pkl
sampling_rate: 24000
save_interval_steps: 10000
train_dumpdir: dump/train_nodev/norm
train_feats_scp: null
train_max_steps: 2500000
train_segments: null
train_wav_scp: null
trim_frame_size: 1024
trim_hop_size: 256
trim_silence: false
trim_threshold_in_db: 20
use_feat_match_loss: true
use_mel_loss: true
use_stft_loss: false
verbose: 1
version: 0.5.1
win_length: 1200
window: hann
what should i change to make it suitable for this project? @yl4579
I'm also using HifiGan, but I decided to finetune from the 2.5m step pre-trained model, should be quicker, I think. But then there's this question I ponder about https://github.com/kan-bayashi/ParallelWaveGAN/issues/373
The pretrained model's preprocessing is different,does it work ?@skol101
pretrained hifigan model: 24k | 80-7600 | 2048 / 300 / 1200
my StarGANv2 config preprocess params
preprocess_params:
sr: 24000
spect_params:
n_fft: 2048
win_length: 1200
hop_length: 300
Or you're saying that we cannot use pre-trained HifiGan model because its dataset was normalized during preprocessing using different algorithm, instead of how it's proposed here:
mel_tensor = (torch.log(1e-5 + mel_tensor) - mean) / std
yes,I mean the preprocessing is different :
to_mel = torchaudio.transforms.MelSpectrogram(
n_mels=80, n_fft=2048, win_length=1200, hop_length=300)
mean, std = -4, 4
def preprocess(wave):
wave_tensor = torch.from_numpy(wave).float()
mel_tensor = to_mel(wave_tensor)
# mel_tensor = (torch.log(mel_tensor.unsqueeze(0)) - mean) / std
mel_tensor = (torch.log(1e-5 + mel_tensor.unsqueeze(0)) - mean) / std
return mel_tensor
Does your hifigan model work?@skol101
Actually after fine tuning for 50k steps from pretrained model I realised something amiss and decided to train Hifigan from the scratch just on my dataset.
What changes have you made to train hifigan, such as config, and the proprecessing of hifigan? @skol101
Guys, it would be best if you'd discuss matters that are about the vocoder training on the appropriate repo, which is https://github.com/kan-bayashi/ParallelWaveGAN and not repurpose closed issues for this. Or you know, discord each other.
Regarding preprocessing changes you can read some of the older issues such as #8 as this is a question that cropped up a few times.
The performance of speaking conversion is good, and the singing conversion is not ideal. If I do singing voice conversion, can you teach me how to use hifigan, hififan also has a pre-model with the same parameters. Do you have any plans to upgrade the singing conversion next?