yl4579 / StarGANv2-VC

StarGANv2-VC: A Diverse, Unsupervised, Non-parallel Framework for Natural-Sounding Voice Conversion
MIT License
486 stars 108 forks source link

Inference with noisy source #21

Open Charlottecuc opened 3 years ago

Charlottecuc commented 3 years ago

Hi. I tested the model with various kinds of wave files as source. I notice that at inference time, the model performs well with clean source files, but for those not so clean audio files (e.g. 24khz speech recorded by mobile phone, with background of air conditioning, or heavy breathing, which is quite common in real life application), the converted speech is sometime incomprehensible and usually with annoying noise.

I also tried denosing these noisy source files (e.g. using Audition, or other speech enhancement tools), but the converted speech became even worse.

Besides, do you think this line mel_tensor = (torch.log(1e-5 + mel_tensor) - self.mean) / self.std to some extent enlarges the noise...?

Could you please give some ideas of making the model more robust with noisy data? Thank you very much.

yl4579 commented 3 years ago

The normalization may have possibly amplified the noises slightly, yet the point of log mel spec is actually the opposite: it tries to emphasize the speech instead of the noise. Probably the arbitrary mean and standard deviation may have some side effect, but if during training you make the model noise-robust, it should have no problems taking noisy input. As mentioned earlier here, if you corrupt your input with Audiomentations, it should have no problems dealing with noisy input. Just make sure you separate your x_real and x_input and only make x_input noisy.

Charlottecuc commented 2 years ago

@yl4579 Hi. Thank you for your reply. Could you give any advice on the percentage of noisy training files ? Or, should all the x_input files be corrupted? I did some experiments and the results are not good. I'm not quite sure whether I wrongly separated x_real and x_input in https://github.com/yl4579/StarGANv2-VC/blob/main/losses.py Thank you very much.

yl4579 commented 2 years ago

@Charlottecuc Sorry for the late reply. I was pretty busy at the end of the year. You can make all x_input corrupted, but I'd recommend you set each transformation with a probability of 0.3, so there will be some samples that are not corrupted.

Charlottecuc commented 2 years ago

The normalization may have possibly amplified the noises slightly, yet the point of log mel spec is actually the opposite: it tries to emphasize the speech instead of the noise. Probably the arbitrary mean and standard deviation may have some side effect, but if during training you make the model noise-robust, it should have no problems taking noisy input. As mentioned earlier here, if you corrupt your input with Audiomentations, it should have no problems dealing with noisy input. Just make sure you separate your x_real and x_input and only make x_input noisy.

@yl4579 Thank you for your reply. Just to make sure, when you mean "making the model noise-robust during training", do you mean only corrputing the inputs of the cycle-consistency loss of generator, or, corrupting the inputs of the whole adversial training process (e.g. adding something like "denoising loss" to make the discriminator capable of classifing between clean and noisy inputs and force the generator to produce clean outputs)? Could you give more details?

Thank you very much.

yl4579 commented 2 years ago

@Charlottecuc I'm sorry for the late reply because this issue was closed and I didn't get any notification. Not sure if it has been resolved, but what I meant was simply corrupting the input to the encoder but asking the model to reconstruct the clean (uncorrupted) version.

MMMMichaelzhang commented 2 years ago

mel_tensor = (torch.log(1e-5 + mel_tensor) - self.mean) / self.std have you fixed the noise problem? when i change mean =0, std =1 the noise gone,but it is too loud. @Charlottecuc

skol101 commented 2 years ago

The normalization may have possibly amplified the noises slightly, yet the point of log mel spec is actually the opposite: it tries to emphasize the speech instead of the noise. Probably the arbitrary mean and standard deviation may have some side effect, but if during training you make the model noise-robust, it should have no problems taking noisy input. As mentioned earlier here, if you corrupt your input with Audiomentations, it should have no problems dealing with noisy input. Just make sure you separate your x_real and x_input and only make x_input noisy.

@yl4579 Please, help me here: where's x_input actually? There's only x_real in trainer.py

x_real, y_org, x_ref, x_ref2, y_trg, z_trg, z_trg2 = batch
yl4579 commented 2 years ago

@skol101 You need to pass in a noisy version here, call it x_input. The x_input is processed in meldataset.py with noises and reveberations.

skol101 commented 2 years ago

I see, because I thought reverb and noise should be added right in StyleEncoder as per https://github.com/yl4579/StarGANv2-VC/issues/6#issuecomment-1103393827

Kristopher-Chen commented 2 years ago

@Charlottecuc I'm sorry for the late reply because this issue was closed and I didn't get any notification. Not sure if it has been resolved, but what I meant was simply corrupting the input to the encoder but asking the model to reconstruct the clean (uncorrupted) version.

@Charlottecuc @yl4579 Are the noisy inputs added only when training the generator, or both the generator and the discriminator? Thank you!

skol101 commented 2 years ago

@Charlottecuc I'm sorry for the late reply because this issue was closed and I didn't get any notification. Not sure if it has been resolved, but what I meant was simply corrupting the input to the encoder but asking the model to reconstruct the clean (uncorrupted) version.

Style encoder is being called several times in the generator, but the only time it's called with x_real param is in the cycle consistency loss. So I guess that's there x_input should be used (but only 30% of the time).

What do you think @Charlottecuc ?

skol101 commented 2 years ago

@yl4579

I'm either doing something wrong or adding reverbs and background noises does nothing. When the source (like VCTK p303_013.wav) has breathing, the converted speech has distortions. Maybe the issue is with the HifiGan vocoder, and I shall try a vocoder more tolerant of breathing/noises.

# cycle-consistency loss
    s_org = nets.style_encoder(x_input, y_org)
    x_rec = nets.generator(x_fake, s_org, masks=None, F0=GAN_F0_fake)
    loss_cyc = torch.mean(torch.abs(x_rec - x_real))

In meldataset.py

def __getitem__(self, idx):
        data = self.data_list[idx]
        mel_tensor, label = self._load_data(data)
        ref_data = random.choice(self.data_list)
        ref_mel_tensor, ref_label = self._load_data(ref_data)
        ref2_data = random.choice(self.data_list_per_class[ref_label])
        ref2_mel_tensor, _ = self._load_data(ref2_data)
        x_input, _ = self._load_data(data, True) #x_input is the same as mel_tensor (aka x_real) but with augmenter corruptions
        return mel_tensor, label, ref_mel_tensor, ref2_mel_tensor, ref_label, x_input

 def _load_tensor(self, data, corrupt_x_input=False):
        wave_path, label = data
        label = int(label)
        wave, sr = sf.read(wave_path)

        if corrupt_x_input and random.uniform(0, 1) <= 0.3:
            augmenter = Compose(
                [
                    RoomSimulator(
                        p=0.8,
                        leave_length_unchanged=True,
                    ),
                    AddBackgroundNoise(
                        sounds_path=BACKGROUND_NOISE_FILES,
                        min_snr_in_db=20,
                        max_snr_in_db=35,
                        p=0.5,
                    )
                ]
            )
            try:
                wave = augmenter(samples=wave, sample_rate=sr)
            except IndexError as error:
                print('error index error with wav file', wave_path)
            except ValueError as errorValue:
                print('error value error with wav file', wave_path)
        wave_tensor = torch.from_numpy(wave).float()
        return wave_tensor, label
skol101 commented 2 years ago

@Charlottecuc this issue should be reopened to discuss further.

skol101 commented 2 years ago

Wow, is this really a mystery @yl4579 ?

Charlottecuc commented 2 years ago

@Charlottecuc I'm sorry for the late reply because this issue was closed and I didn't get any notification. Not sure if it has been resolved, but what I meant was simply corrupting the input to the encoder but asking the model to reconstruct the clean (uncorrupted) version.

Style encoder is being called several times in the generator, but the only time it's called with x_real param is in the cycle consistency loss. So I guess that's there x_input should be used (but only 30% of the time).

What do you think @Charlottecuc ?

I think it's not a good idea to add data aug in style encoder since source audio stream will not flow into style encoder at inference time. In fact I tried almost all the mentioned denoising methods, either suggested in this post or others. Some of them can to some extent reduce the artifacts, but in total, the model is not stable with noisy source wave. It's not like PPG-based models, on which you can easily and clearly design some denoising loss.

Charlottecuc commented 2 years ago

@Charlottecuc this issue should be reopened to discuss further.

The issue was closed by @yl4579 , and I am not able to reopen it. I argee that it should be reopened.

Charlottecuc commented 2 years ago

@yl4579

I'm either doing something wrong or adding reverbs and background noises does nothing. When the source (like VCTK p303_013.wav) has breathing, the converted speech has distortions. Maybe the issue is with the HifiGan vocoder, and I shall try a vocoder more tolerant of breathing/noises.

# cycle-consistency loss
    s_org = nets.style_encoder(x_input, y_org)
    x_rec = nets.generator(x_fake, s_org, masks=None, F0=GAN_F0_fake)
    loss_cyc = torch.mean(torch.abs(x_rec - x_real))

In meldataset.py

def __getitem__(self, idx):
        data = self.data_list[idx]
        mel_tensor, label = self._load_data(data)
        ref_data = random.choice(self.data_list)
        ref_mel_tensor, ref_label = self._load_data(ref_data)
        ref2_data = random.choice(self.data_list_per_class[ref_label])
        ref2_mel_tensor, _ = self._load_data(ref2_data)
        x_input, _ = self._load_data(data, True) #x_input is the same as mel_tensor (aka x_real) but with augmenter corruptions
        return mel_tensor, label, ref_mel_tensor, ref2_mel_tensor, ref_label, x_input

 def _load_tensor(self, data, corrupt_x_input=False):
        wave_path, label = data
        label = int(label)
        wave, sr = sf.read(wave_path)

        if corrupt_x_input and random.uniform(0, 1) <= 0.3:
            augmenter = Compose(
                [
                    RoomSimulator(
                        p=0.8,
                        leave_length_unchanged=True,
                    ),
                    AddBackgroundNoise(
                        sounds_path=BACKGROUND_NOISE_FILES,
                        min_snr_in_db=20,
                        max_snr_in_db=35,
                        p=0.5,
                    )
                ]
            )
            try:
                wave = augmenter(samples=wave, sample_rate=sr)
            except IndexError as error:
                print('error index error with wav file', wave_path)
            except ValueError as errorValue:
                print('error value error with wav file', wave_path)
        wave_tensor = torch.from_numpy(wave).float()
        return wave_tensor, label

Training a denoising HiFi-GAN can not largely improve the results for the current issue. Because if you look at the mel-spectrograms, you can see that some parts are vague and unclear if the quality of source wave is low.

skol101 commented 2 years ago

Here it was reported that added reverb/background noises did help https://github.com/yl4579/StarGANv2-VC/issues/6#issuecomment-1108044051

Maybe the solution is to denoise the input wave before proceeding with inference, so something like Facebook denoiser can be used, but this suggestion points to using noise trained vocoder https://github.com/yl4579/StarGANv2-VC/issues/6#issuecomment-1099653031

https://github.com/facebookresearch/denoiser https://github.com/rishikksh20/hifigan-denoiser

Charlottecuc commented 2 years ago

Here it was reported that added reverb/background noises did help #6 (comment)

Maybe the solution is to denoise the input wave before proceeding with inference, so something like Facebook denoiser can be used, but this suggestion points to using noise trained vocoder #6 (comment)

If you would like to train an any-to-any model, then add data aug to style encoder will help. If you denoise the source wave before inference, some distortions caused by noise will disappear, but new issue will come, since most denosing models will weaken the voice when eliminating noises and lead to new VC distortions. The problem was confirmed by many VC papers. Training denoising HiFi-GAN might help, but may not achieve what you expect for VC issues. Because when inferencing with noisy wave, there might be some mispronunciations in converted mels, which can not be post-corrected by vocoder.

yl4579 commented 2 years ago

@Charlottecuc Sorry I'm pretty busy with my other paper submissions so I can't join the discussion at this point, but I have reopened the issue for further discussion and will provide some feedback after I finished my work.

yl4579 commented 2 years ago

@Charlottecuc I do have some time now to discuss this problem. I have noticed similar problems with noisy input and have not yet come up with a good solution. The major problem with the GAN-based model is that it is difficult to design denoise loss functions because the target is not as clear as in PPG or TTS based VC models (in that case you have L1 reconstruction loss directly). Not sure if you have got any good solution to this problem, but I would suggest adding some noises in the time-frequency domain by reverse mel-scale and recomputing the mel scale (or you can train a model end-to-end if you prefer).

The key here is to add noise to the converted speech and force the model to convert the converted speech back to the clean output. Because one problem I noticed is that even if you add noise to the input during training, the model does not produce good converted examples sometimes. It somehow finds a way to trick the loss function so that the converted speech is not clear, but the second time conversion back to the source domain works quite well so the cycle consistency loss is still low. Adding noises to the converted results force the model to denoise the noisy speech directly. Another way is to add a denoise loss directly where the input is a noisy speech with the source style vector and the output is a clean speech. This might make the model overfit however so the converted speech might not sound similar to the target. This is in general a challenge in this field and there's still a lot of work to be done.

skol101 commented 2 years ago

This does a pretty good job of removing noises from the speech https://github.com/Rikorose/DeepFilterNet

mayank-git-hub commented 1 year ago

Another approach which works is to first train the model on a clean dataset and once the model is trained, freeze the model parameters and add two enhancement blocks to the encoder and the style encoder to enhance the noisy voice in the feature domains using synthetically distorted data. We use the embeddings extracted from clean samples by the original frozen encoder as targets and train the newly added enhancement blocks by minimizing the L1 distance between the target and the outputs obtained by the encoders with enhancement blocks from the distorted samples.

You can refer to our paper https://arxiv.org/pdf/2210.11096.pdf which shows the figures and results on distorted/noisy samples using StarGANv2-vc model architecture.

Patelraj8694 commented 6 months ago

Hi @mayank-git-hub , I have similar application to your idea and I want speech conversion whisper or distorted speech. I do not have that much knowledge in fine tuning model , can you help me out ?