Open Charlottecuc opened 3 years ago
The normalization may have possibly amplified the noises slightly, yet the point of log mel spec is actually the opposite: it tries to emphasize the speech instead of the noise. Probably the arbitrary mean and standard deviation may have some side effect, but if during training you make the model noise-robust, it should have no problems taking noisy input. As mentioned earlier here, if you corrupt your input with Audiomentations, it should have no problems dealing with noisy input. Just make sure you separate your x_real
and x_input
and only make x_input
noisy.
@yl4579 Hi. Thank you for your reply.
Could you give any advice on the percentage of noisy training files ? Or, should all the x_input
files be corrupted?
I did some experiments and the results are not good. I'm not quite sure whether I wrongly separated x_real
and x_input
in https://github.com/yl4579/StarGANv2-VC/blob/main/losses.py
Thank you very much.
@Charlottecuc Sorry for the late reply. I was pretty busy at the end of the year. You can make all x_input
corrupted, but I'd recommend you set each transformation with a probability of 0.3, so there will be some samples that are not corrupted.
The normalization may have possibly amplified the noises slightly, yet the point of log mel spec is actually the opposite: it tries to emphasize the speech instead of the noise. Probably the arbitrary mean and standard deviation may have some side effect, but if during training you make the model noise-robust, it should have no problems taking noisy input. As mentioned earlier here, if you corrupt your input with Audiomentations, it should have no problems dealing with noisy input. Just make sure you separate your
x_real
andx_input
and only makex_input
noisy.
@yl4579 Thank you for your reply. Just to make sure, when you mean "making the model noise-robust during training", do you mean only corrputing the inputs of the cycle-consistency loss of generator, or, corrupting the inputs of the whole adversial training process (e.g. adding something like "denoising loss" to make the discriminator capable of classifing between clean and noisy inputs and force the generator to produce clean outputs)? Could you give more details?
Thank you very much.
@Charlottecuc I'm sorry for the late reply because this issue was closed and I didn't get any notification. Not sure if it has been resolved, but what I meant was simply corrupting the input to the encoder but asking the model to reconstruct the clean (uncorrupted) version.
mel_tensor = (torch.log(1e-5 + mel_tensor) - self.mean) / self.std have you fixed the noise problem? when i change mean =0, std =1 the noise gone,but it is too loud. @Charlottecuc
The normalization may have possibly amplified the noises slightly, yet the point of log mel spec is actually the opposite: it tries to emphasize the speech instead of the noise. Probably the arbitrary mean and standard deviation may have some side effect, but if during training you make the model noise-robust, it should have no problems taking noisy input. As mentioned earlier here, if you corrupt your input with Audiomentations, it should have no problems dealing with noisy input. Just make sure you separate your
x_real
andx_input
and only makex_input
noisy.
@yl4579 Please, help me here: where's x_input actually? There's only x_real in trainer.py
x_real, y_org, x_ref, x_ref2, y_trg, z_trg, z_trg2 = batch
@skol101 You need to pass in a noisy version here, call it x_input
. The x_input
is processed in meldataset.py
with noises and reveberations.
I see, because I thought reverb and noise should be added right in StyleEncoder as per https://github.com/yl4579/StarGANv2-VC/issues/6#issuecomment-1103393827
@Charlottecuc I'm sorry for the late reply because this issue was closed and I didn't get any notification. Not sure if it has been resolved, but what I meant was simply corrupting the input to the encoder but asking the model to reconstruct the clean (uncorrupted) version.
@Charlottecuc @yl4579 Are the noisy inputs added only when training the generator, or both the generator and the discriminator? Thank you!
@Charlottecuc I'm sorry for the late reply because this issue was closed and I didn't get any notification. Not sure if it has been resolved, but what I meant was simply corrupting the input to the encoder but asking the model to reconstruct the clean (uncorrupted) version.
Style encoder is being called several times in the generator, but the only time it's called with x_real param is in the cycle consistency loss. So I guess that's there x_input should be used (but only 30% of the time).
What do you think @Charlottecuc ?
@yl4579
I'm either doing something wrong or adding reverbs and background noises does nothing. When the source (like VCTK p303_013.wav) has breathing, the converted speech has distortions. Maybe the issue is with the HifiGan vocoder, and I shall try a vocoder more tolerant of breathing/noises.
# cycle-consistency loss
s_org = nets.style_encoder(x_input, y_org)
x_rec = nets.generator(x_fake, s_org, masks=None, F0=GAN_F0_fake)
loss_cyc = torch.mean(torch.abs(x_rec - x_real))
In meldataset.py
def __getitem__(self, idx):
data = self.data_list[idx]
mel_tensor, label = self._load_data(data)
ref_data = random.choice(self.data_list)
ref_mel_tensor, ref_label = self._load_data(ref_data)
ref2_data = random.choice(self.data_list_per_class[ref_label])
ref2_mel_tensor, _ = self._load_data(ref2_data)
x_input, _ = self._load_data(data, True) #x_input is the same as mel_tensor (aka x_real) but with augmenter corruptions
return mel_tensor, label, ref_mel_tensor, ref2_mel_tensor, ref_label, x_input
def _load_tensor(self, data, corrupt_x_input=False):
wave_path, label = data
label = int(label)
wave, sr = sf.read(wave_path)
if corrupt_x_input and random.uniform(0, 1) <= 0.3:
augmenter = Compose(
[
RoomSimulator(
p=0.8,
leave_length_unchanged=True,
),
AddBackgroundNoise(
sounds_path=BACKGROUND_NOISE_FILES,
min_snr_in_db=20,
max_snr_in_db=35,
p=0.5,
)
]
)
try:
wave = augmenter(samples=wave, sample_rate=sr)
except IndexError as error:
print('error index error with wav file', wave_path)
except ValueError as errorValue:
print('error value error with wav file', wave_path)
wave_tensor = torch.from_numpy(wave).float()
return wave_tensor, label
@Charlottecuc this issue should be reopened to discuss further.
Wow, is this really a mystery @yl4579 ?
@Charlottecuc I'm sorry for the late reply because this issue was closed and I didn't get any notification. Not sure if it has been resolved, but what I meant was simply corrupting the input to the encoder but asking the model to reconstruct the clean (uncorrupted) version.
Style encoder is being called several times in the generator, but the only time it's called with x_real param is in the cycle consistency loss. So I guess that's there x_input should be used (but only 30% of the time).
What do you think @Charlottecuc ?
I think it's not a good idea to add data aug in style encoder since source audio stream will not flow into style encoder at inference time. In fact I tried almost all the mentioned denoising methods, either suggested in this post or others. Some of them can to some extent reduce the artifacts, but in total, the model is not stable with noisy source wave. It's not like PPG-based models, on which you can easily and clearly design some denoising loss.
@Charlottecuc this issue should be reopened to discuss further.
The issue was closed by @yl4579 , and I am not able to reopen it. I argee that it should be reopened.
@yl4579
I'm either doing something wrong or adding reverbs and background noises does nothing. When the source (like VCTK p303_013.wav) has breathing, the converted speech has distortions. Maybe the issue is with the HifiGan vocoder, and I shall try a vocoder more tolerant of breathing/noises.
# cycle-consistency loss s_org = nets.style_encoder(x_input, y_org) x_rec = nets.generator(x_fake, s_org, masks=None, F0=GAN_F0_fake) loss_cyc = torch.mean(torch.abs(x_rec - x_real))
In meldataset.py
def __getitem__(self, idx): data = self.data_list[idx] mel_tensor, label = self._load_data(data) ref_data = random.choice(self.data_list) ref_mel_tensor, ref_label = self._load_data(ref_data) ref2_data = random.choice(self.data_list_per_class[ref_label]) ref2_mel_tensor, _ = self._load_data(ref2_data) x_input, _ = self._load_data(data, True) #x_input is the same as mel_tensor (aka x_real) but with augmenter corruptions return mel_tensor, label, ref_mel_tensor, ref2_mel_tensor, ref_label, x_input def _load_tensor(self, data, corrupt_x_input=False): wave_path, label = data label = int(label) wave, sr = sf.read(wave_path) if corrupt_x_input and random.uniform(0, 1) <= 0.3: augmenter = Compose( [ RoomSimulator( p=0.8, leave_length_unchanged=True, ), AddBackgroundNoise( sounds_path=BACKGROUND_NOISE_FILES, min_snr_in_db=20, max_snr_in_db=35, p=0.5, ) ] ) try: wave = augmenter(samples=wave, sample_rate=sr) except IndexError as error: print('error index error with wav file', wave_path) except ValueError as errorValue: print('error value error with wav file', wave_path) wave_tensor = torch.from_numpy(wave).float() return wave_tensor, label
Training a denoising HiFi-GAN can not largely improve the results for the current issue. Because if you look at the mel-spectrograms, you can see that some parts are vague and unclear if the quality of source wave is low.
Here it was reported that added reverb/background noises did help https://github.com/yl4579/StarGANv2-VC/issues/6#issuecomment-1108044051
Maybe the solution is to denoise the input wave before proceeding with inference, so something like Facebook denoiser can be used, but this suggestion points to using noise trained vocoder https://github.com/yl4579/StarGANv2-VC/issues/6#issuecomment-1099653031
https://github.com/facebookresearch/denoiser https://github.com/rishikksh20/hifigan-denoiser
Here it was reported that added reverb/background noises did help #6 (comment)
Maybe the solution is to denoise the input wave before proceeding with inference, so something like Facebook denoiser can be used, but this suggestion points to using noise trained vocoder #6 (comment)
If you would like to train an any-to-any model, then add data aug to style encoder will help. If you denoise the source wave before inference, some distortions caused by noise will disappear, but new issue will come, since most denosing models will weaken the voice when eliminating noises and lead to new VC distortions. The problem was confirmed by many VC papers. Training denoising HiFi-GAN might help, but may not achieve what you expect for VC issues. Because when inferencing with noisy wave, there might be some mispronunciations in converted mels, which can not be post-corrected by vocoder.
@Charlottecuc Sorry I'm pretty busy with my other paper submissions so I can't join the discussion at this point, but I have reopened the issue for further discussion and will provide some feedback after I finished my work.
@Charlottecuc I do have some time now to discuss this problem. I have noticed similar problems with noisy input and have not yet come up with a good solution. The major problem with the GAN-based model is that it is difficult to design denoise loss functions because the target is not as clear as in PPG or TTS based VC models (in that case you have L1 reconstruction loss directly). Not sure if you have got any good solution to this problem, but I would suggest adding some noises in the time-frequency domain by reverse mel-scale and recomputing the mel scale (or you can train a model end-to-end if you prefer).
The key here is to add noise to the converted speech and force the model to convert the converted speech back to the clean output. Because one problem I noticed is that even if you add noise to the input during training, the model does not produce good converted examples sometimes. It somehow finds a way to trick the loss function so that the converted speech is not clear, but the second time conversion back to the source domain works quite well so the cycle consistency loss is still low. Adding noises to the converted results force the model to denoise the noisy speech directly. Another way is to add a denoise loss directly where the input is a noisy speech with the source style vector and the output is a clean speech. This might make the model overfit however so the converted speech might not sound similar to the target. This is in general a challenge in this field and there's still a lot of work to be done.
This does a pretty good job of removing noises from the speech https://github.com/Rikorose/DeepFilterNet
Another approach which works is to first train the model on a clean dataset and once the model is trained, freeze the model parameters and add two enhancement blocks to the encoder and the style encoder to enhance the noisy voice in the feature domains using synthetically distorted data. We use the embeddings extracted from clean samples by the original frozen encoder as targets and train the newly added enhancement blocks by minimizing the L1 distance between the target and the outputs obtained by the encoders with enhancement blocks from the distorted samples.
You can refer to our paper https://arxiv.org/pdf/2210.11096.pdf which shows the figures and results on distorted/noisy samples using StarGANv2-vc model architecture.
Hi @mayank-git-hub , I have similar application to your idea and I want speech conversion whisper or distorted speech. I do not have that much knowledge in fine tuning model , can you help me out ?
Hi. I tested the model with various kinds of wave files as source. I notice that at inference time, the model performs well with clean source files, but for those not so clean audio files (e.g. 24khz speech recorded by mobile phone, with background of air conditioning, or heavy breathing, which is quite common in real life application), the converted speech is sometime incomprehensible and usually with annoying noise.
I also tried denosing these noisy source files (e.g. using Audition, or other speech enhancement tools), but the converted speech became even worse.
Besides, do you think this line
mel_tensor = (torch.log(1e-5 + mel_tensor) - self.mean) / self.std
to some extent enlarges the noise...?Could you please give some ideas of making the model more robust with noisy data? Thank you very much.