yxlu-0102 / MP-SENet

MP-SENet: A Speech Enhancement Model with Parallel Denoising of Magnitude and Phase Spectra
MIT License
267 stars 40 forks source link

Fail to reproduce the paper result when training from scratch #15

Open didadida-r opened 6 months ago

didadida-r commented 6 months ago

Hi, i try to train the model from scratch, but it fail to reproduce the result in VoiceBank_DEMAND, but the training is unstable and result is bad, can you give some advice. Thanks.

the tensorboard log is: image

the only different is the config, and the diff result is:

 {
-    "batch_size": 4,
+    "batch_size": 8,
@@ -13,12 +13,12 @@

-    "segment_size": 32000,
+    "segment_size": 16000,

-    "num_workers": 4,
+    "num_workers": 8,
yxlu-0102 commented 6 months ago

Since our model is based on self-attention, the segment size may have an impact on the overall performance, the tensorboard log should be like this:

tensorboard
didadida-r commented 6 months ago

Thank you for your response. I have adjusted the segment size to 32000. Here are the results from TensorBoard. After downloading the dataset and running the script, the training result is unstable, with a maximum PESQ of 3.3.

Is your tensorboard result[PESQ 3.56] reproduced from the offical setup config and code

image

yxlu-0102 commented 6 months ago

This TensorBoard result is from our subsequent improvements. I deleted the previous results, but there were no issues when I ran it before. From the tensorboard you provided, it seems there is a problem with the loss reduction in the magnitude spectrum.

didadida-r commented 6 months ago

Is the multi-GPU training process important? I am training the model using two GPUs.

yxlu-0102 commented 6 months ago

The impact of multi-GPU training on the experimental results should be minimal.

JangyeonKim commented 2 months ago

1 2 3

Hello. Thank you for sharing the excellent code.

I am trying to replicate the performance on the DNS dataset after I read the long version and short version of MP-SENet but I am failing. I have matched all the configurations mentioned in the paper (2 sec segment, optimizer, lr, batch size, etc.). Could you provide some advice regarding this?

Additionally, for now, I am trying to see the performance of the generator without the Metric Discriminator. Could you share the loss graphs from the ablation study? All the images I attached are smoothed graphs with a 0.7 factor.

Any advice would be greatly appreciated.

JangyeonKim commented 2 months ago

`class MPNetLoss(nn.Module) : def init(self, h): super(MPNetLoss, self).init() self.mse_loss = nn.MSELoss() self.l1_loss = nn.L1Loss() self.h = h

def forward(self, mag_pred, pha_pred, com_pred, S_true):
    clean_audio = S_true
    clean_mag, clean_pha, clean_com = mag_pha_stft(clean_audio, self.h)
    enhanced_mag, enhanced_pha, enhanced_com = mag_pred, pha_pred, com_pred
    enhanced_audio = mag_pha_istft(enhanced_mag, enhanced_pha, self.h)

    loss_mag = self.mse_loss(clean_mag, enhanced_mag)

    loss_ip, loss_gd, loss_iaf = phase_losses(clean_pha, enhanced_pha, self.h)
    loss_pha = loss_ip + loss_gd + loss_iaf

    loss_com = self.mse_loss(clean_com, enhanced_com) * 2

    loss_time = self.l1_loss(clean_audio, enhanced_audio)

    _, _, spec_for_concsistency = mag_pha_stft(enhanced_audio, self.h) # istft -> stft
    loss_con = self.mse_loss(enhanced_com, spec_for_concsistency) * 2

    # loss_all = 0.9 * loss_mag + 0.3 * loss_pha + 0.1 * loss_com + 0.2 * loss_time
    loss_all = 0.9 * loss_mag + 0.3 * loss_pha + 0.1 * loss_com + 0.1 * loss_con

    return {
        'magnitude_loss': loss_mag,
        'phase_loss': loss_pha,
        'complex_loss': loss_com,
        # 'time_loss': loss_time,
        'consistency_loss': loss_con,
        'total_loss': loss_all
    }`

To aid in seeking advice, I am also attaching the code used to calculate the loss.

yxlu-0102 commented 2 months ago

Sorry, I deleted the corresponding TensorBoard files after completing the ablation experiment.

I reviewed your code, and it seems to be fine. Can you offer me the loss curves of your results on the DNS dataset?

JangyeonKim @.***> 于2024年5月24日周五 09:58写道:

`class MPNetLoss(nn.Module) : def init(self, h): super(MPNetLoss, self).init() self.mse_loss = nn.MSELoss() self.l1_loss = nn.L1Loss() self.h = h

def forward(self, mag_pred, pha_pred, com_pred, S_true): clean_audio = S_true clean_mag, clean_pha, clean_com = mag_pha_stft(clean_audio, self.h) enhanced_mag, enhanced_pha, enhanced_com = mag_pred, pha_pred, com_pred enhanced_audio = mag_pha_istft(enhanced_mag, enhanced_pha, self.h)

loss_mag = self.mse_loss(clean_mag, enhanced_mag)

loss_ip, loss_gd, loss_iaf = phase_losses(clean_pha, enhanced_pha, self.h)
loss_pha = loss_ip + loss_gd + loss_iaf

loss_com = self.mse_loss(clean_com, enhanced_com) * 2

loss_time = self.l1_loss(clean_audio, enhanced_audio)

_, _, spec_for_concsistency = mag_pha_stft(enhanced_audio, self.h) # istft -> stft
loss_con = self.mse_loss(enhanced_com, spec_for_concsistency) * 2

# loss_all = 0.9 * loss_mag + 0.3 * loss_pha + 0.1 * loss_com + 0.2 * loss_time
loss_all = 0.9 * loss_mag + 0.3 * loss_pha + 0.1 * loss_com + 0.1 * loss_con

return {
    'magnitude_loss': loss_mag,
    'phase_loss': loss_pha,
    'complex_loss': loss_com,
    # 'time_loss': loss_time,
    'consistency_loss': loss_con,
    'total_loss': loss_all
}`

To aid in seeking advice, I am also attaching the code used to calculate the loss.

— Reply to this email directly, view it on GitHub https://github.com/yxlu-0102/MP-SENet/issues/15#issuecomment-2128347848, or unsubscribe https://github.com/notifications/unsubscribe-auth/ATTJFEBF3RIAGEHKKWYBFHTZD2NCXAVCNFSM6AAAAABBMVMAPCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMRYGM2DOOBUHA . You are receiving this because you commented.Message ID: @.***>

JangyeonKim commented 2 months ago

log.zip

Thank you for your answer. Are the loss curves you mentioned referring to the tensorboard logs? I have attached the file for your reference.

yxlu-0102 commented 2 months ago

The loss curve does look quite strange. The phase loss during training doesn’t seem to decrease significantly, and both the magnitude and phase losses during validation are very odd.

However, there is nothing wrong with the loss calculation code you provided. May I ask if the training was normal on the VoiceBank+DEMAND dataset?

JangyeonKim commented 2 months ago

I am currently training using only the DNS dataset.

I will download the 16kHz version of the VoiceBank+DEMAND dataset from your repository, apply it, and share the loss graph with you.

yxlu-0102 commented 2 months ago

ok

yunzqq commented 2 weeks ago

I am currently training using only the DNS dataset.

I will download the 16kHz version of the VoiceBank+DEMAND dataset from your repository, apply it, and share the loss graph with you.

If you can obtain the results for DEMAND in paper when you use the g_best model?

JangyeonKim commented 2 weeks ago

Oh, I forgot to report the experimental results. I am really sorry about that. I only used the MP-SENet model and trained it in my training framework, and I have not used the g_best model.

Currently, the above issue has been resolved, and I have achieved a result of 3.41 WB-PESQ on the DNS dataset and 3.44 WB-PESQ on the VoiceBank+DEMAND dataset.

image

jeffery-work commented 1 day ago

Since our model is based on self-attention, the segment size may have an impact on the overall performance, the tensorboard log should be like this: tensorboard

“Since our model is based on self-attention, the segment size may have an impact on the overall performance,” I check the AttentionModule implement, It seems like time attention sequence length is the length of segment? why need such a long lenght?