Closed MMMMichaelzhang closed 2 years ago
Sorry for the late reply. I was pretty busy at the end of my semester. I'm interested in singing conversion with the same architecture, especially with Hifi-GAN. Actually, I already have some idea of how to improve the results.
Hifi-GAN does significantly improve the results especially if you fine-tune it. The pre-trained Hifi-GAN likely won't work because the preprocessing is different. I will make my pre-trained HifiGAN available soon with the same preprocessing as this repo.
Other ways of improving the results include some tricks of incorporating the F0 features and some architecture changes. I will work more on it and maybe write a new paper later.
if I want to train hifigan,the config file is like this:
allow_cache: true
batch_max_steps: 8400
batch_size: 16
config: conf/hifigan.v1.yaml
dev_dumpdir: dump/dev/norm
dev_feats_scp: null
dev_segments: null
dev_wav_scp: null
average_by_discriminators: false
discriminator_grad_norm: -1
- 0.5
- 0.9
lr: 0.0002
weight_decay: 0.0
discriminator_optimizer_type: Adam
follow_official_norm: true
bias: true
channels: 32
- 3
- 3
- 3
- 3
- 1
in_channels: 1
- 5
- 3
max_downsample_channels: 1024
nonlinear_activation: LeakyReLU
negative_slope: 0.1
out_channels: 1
use_spectral_norm: false
use_weight_norm: true
- 2
- 3
- 5
- 7
- 11
bias: true
channels: 128
- 4
- 4
- 4
- 4
- 1
in_channels: 1
- 15
- 41
- 5
- 3
max_downsample_channels: 1024
max_groups: 16
nonlinear_activation: LeakyReLU
negative_slope: 0.1
out_channels: 1
scale_downsample_pooling: AvgPool1d
kernel_size: 4
padding: 2
stride: 2
scales: 3
gamma: 0.5
- 200000
- 400000
- 600000
- 800000
discriminator_scheduler_type: MultiStepLR
discriminator_train_start_steps: 0
discriminator_type: HiFiGANMultiScaleMultiPeriodDiscriminator
distributed: false
eval_interval_steps: 1000
average_by_discriminators: false
average_by_layers: false
include_final_outputs: false
fft_size: 2048
fmax: 7600
fmin: 80
format: hdf5
average_by_discriminators: false
generator_grad_norm: -1
- 0.5
- 0.9
lr: 0.0002
weight_decay: 0.0
generator_optimizer_type: Adam
bias: true
channels: 512
in_channels: 80
kernel_size: 7
nonlinear_activation: LeakyReLU
negative_slope: 0.1
out_channels: 1
- - 1
- 3
- 5
- - 1
- 3
- 5
- - 1
- 3
- 5
- 3
- 7
- 11
- 10
- 10
- 8
- 6
- 5
- 5
- 4
- 3
use_additional_convs: true
use_weight_norm: true
gamma: 0.5
- 200000
- 400000
- 600000
- 800000
generator_scheduler_type: MultiStepLR
generator_train_start_steps: 1
generator_type: HiFiGANGenerator
global_gain_scale: 1.0
hop_size: 300
lambda_adv: 1.0
lambda_aux: 45.0
lambda_feat_match: 2.0
log_interval_steps: 100
fft_size: 2048
fmax: 12000
fmin: 0
fs: 24000
hop_size: 300
log_base: null
num_mels: 80
win_length: 1200
window: hann
num_mels: 80
num_save_intermediate_results: 4
num_workers: 2
outdir: exp/train_nodev_csmsc_hifigan.v1
pin_memory: true
pretrain: ''
rank: 0
remove_short_samples: false
resume: exp/train_nodev_csmsc_hifigan.v1/checkpoint-2370000steps.pkl
sampling_rate: 24000
save_interval_steps: 10000
train_dumpdir: dump/train_nodev/norm
train_feats_scp: null
train_max_steps: 2500000
train_segments: null
train_wav_scp: null
trim_frame_size: 1024
trim_hop_size: 256
trim_silence: false
trim_threshold_in_db: 20
use_feat_match_loss: true
use_mel_loss: true
use_stft_loss: false
verbose: 1
version: 0.5.1
win_length: 1200
window: hann
what should i change to make it suitable for this project? @yl4579
I'm also using HifiGan, but I decided to finetune from the 2.5m step pre-trained model, should be quicker, I think. But then there's this question I ponder about
The pretrained model's preprocessing is different,does it work ?@skol101
pretrained hifigan model: 24k | 80-7600 | 2048 / 300 / 1200
my StarGANv2 config preprocess params
sr: 24000
n_fft: 2048
win_length: 1200
hop_length: 300
Or you're saying that we cannot use pre-trained HifiGan model because its dataset was normalized during preprocessing using different algorithm, instead of how it's proposed here:
mel_tensor = (torch.log(1e-5 + mel_tensor) - mean) / std
yes,I mean the preprocessing is different :
to_mel = torchaudio.transforms.MelSpectrogram(
n_mels=80, n_fft=2048, win_length=1200, hop_length=300)
mean, std = -4, 4
def preprocess(wave):
wave_tensor = torch.from_numpy(wave).float()
mel_tensor = to_mel(wave_tensor)
# mel_tensor = (torch.log(mel_tensor.unsqueeze(0)) - mean) / std
mel_tensor = (torch.log(1e-5 + mel_tensor.unsqueeze(0)) - mean) / std
return mel_tensor
Does your hifigan model work?@skol101
Actually after fine tuning for 50k steps from pretrained model I realised something amiss and decided to train Hifigan from the scratch just on my dataset.
What changes have you made to train hifigan, such as config, and the proprecessing of hifigan? @skol101
Guys, it would be best if you'd discuss matters that are about the vocoder training on the appropriate repo, which is and not repurpose closed issues for this. Or you know, discord each other.
Regarding preprocessing changes you can read some of the older issues such as #8 as this is a question that cropped up a few times.
The performance of speaking conversion is good, and the singing conversion is not ideal. If I do singing voice conversion, can you teach me how to use hifigan, hififan also has a pre-model with the same parameters. Do you have any plans to upgrade the singing conversion next?