Open didadida-r opened 6 months ago
Since our model is based on self-attention, the segment size may have an impact on the overall performance, the tensorboard log should be like this:
Thank you for your response. I have adjusted the segment size to 32000. Here are the results from TensorBoard. After downloading the dataset and running the script, the training result is unstable, with a maximum PESQ of 3.3.
Is your tensorboard result[PESQ 3.56] reproduced from the offical setup config and code
This TensorBoard result is from our subsequent improvements. I deleted the previous results, but there were no issues when I ran it before. From the tensorboard you provided, it seems there is a problem with the loss reduction in the magnitude spectrum.
Is the multi-GPU training process important? I am training the model using two GPUs.
The impact of multi-GPU training on the experimental results should be minimal.
Hello. Thank you for sharing the excellent code.
I am trying to replicate the performance on the DNS dataset after I read the long version and short version of MP-SENet but I am failing. I have matched all the configurations mentioned in the paper (2 sec segment, optimizer, lr, batch size, etc.). Could you provide some advice regarding this?
Additionally, for now, I am trying to see the performance of the generator without the Metric Discriminator. Could you share the loss graphs from the ablation study? All the images I attached are smoothed graphs with a 0.7 factor.
Any advice would be greatly appreciated.
`class MPNetLoss(nn.Module) : def init(self, h): super(MPNetLoss, self).init() self.mse_loss = nn.MSELoss() self.l1_loss = nn.L1Loss() self.h = h
def forward(self, mag_pred, pha_pred, com_pred, S_true):
clean_audio = S_true
clean_mag, clean_pha, clean_com = mag_pha_stft(clean_audio, self.h)
enhanced_mag, enhanced_pha, enhanced_com = mag_pred, pha_pred, com_pred
enhanced_audio = mag_pha_istft(enhanced_mag, enhanced_pha, self.h)
loss_mag = self.mse_loss(clean_mag, enhanced_mag)
loss_ip, loss_gd, loss_iaf = phase_losses(clean_pha, enhanced_pha, self.h)
loss_pha = loss_ip + loss_gd + loss_iaf
loss_com = self.mse_loss(clean_com, enhanced_com) * 2
loss_time = self.l1_loss(clean_audio, enhanced_audio)
_, _, spec_for_concsistency = mag_pha_stft(enhanced_audio, self.h) # istft -> stft
loss_con = self.mse_loss(enhanced_com, spec_for_concsistency) * 2
# loss_all = 0.9 * loss_mag + 0.3 * loss_pha + 0.1 * loss_com + 0.2 * loss_time
loss_all = 0.9 * loss_mag + 0.3 * loss_pha + 0.1 * loss_com + 0.1 * loss_con
return {
'magnitude_loss': loss_mag,
'phase_loss': loss_pha,
'complex_loss': loss_com,
# 'time_loss': loss_time,
'consistency_loss': loss_con,
'total_loss': loss_all
}`
To aid in seeking advice, I am also attaching the code used to calculate the loss.
Sorry, I deleted the corresponding TensorBoard files after completing the ablation experiment.
I reviewed your code, and it seems to be fine. Can you offer me the loss curves of your results on the DNS dataset?
JangyeonKim @.***> 于2024年5月24日周五 09:58写道:
`class MPNetLoss(nn.Module) : def init(self, h): super(MPNetLoss, self).init() self.mse_loss = nn.MSELoss() self.l1_loss = nn.L1Loss() self.h = h
def forward(self, mag_pred, pha_pred, com_pred, S_true): clean_audio = S_true clean_mag, clean_pha, clean_com = mag_pha_stft(clean_audio, self.h) enhanced_mag, enhanced_pha, enhanced_com = mag_pred, pha_pred, com_pred enhanced_audio = mag_pha_istft(enhanced_mag, enhanced_pha, self.h)
loss_mag = self.mse_loss(clean_mag, enhanced_mag) loss_ip, loss_gd, loss_iaf = phase_losses(clean_pha, enhanced_pha, self.h) loss_pha = loss_ip + loss_gd + loss_iaf loss_com = self.mse_loss(clean_com, enhanced_com) * 2 loss_time = self.l1_loss(clean_audio, enhanced_audio) _, _, spec_for_concsistency = mag_pha_stft(enhanced_audio, self.h) # istft -> stft loss_con = self.mse_loss(enhanced_com, spec_for_concsistency) * 2 # loss_all = 0.9 * loss_mag + 0.3 * loss_pha + 0.1 * loss_com + 0.2 * loss_time loss_all = 0.9 * loss_mag + 0.3 * loss_pha + 0.1 * loss_com + 0.1 * loss_con return { 'magnitude_loss': loss_mag, 'phase_loss': loss_pha, 'complex_loss': loss_com, # 'time_loss': loss_time, 'consistency_loss': loss_con, 'total_loss': loss_all }`
To aid in seeking advice, I am also attaching the code used to calculate the loss.
— Reply to this email directly, view it on GitHub https://github.com/yxlu-0102/MP-SENet/issues/15#issuecomment-2128347848, or unsubscribe https://github.com/notifications/unsubscribe-auth/ATTJFEBF3RIAGEHKKWYBFHTZD2NCXAVCNFSM6AAAAABBMVMAPCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMRYGM2DOOBUHA . You are receiving this because you commented.Message ID: @.***>
Thank you for your answer. Are the loss curves you mentioned referring to the tensorboard logs? I have attached the file for your reference.
The loss curve does look quite strange. The phase loss during training doesn’t seem to decrease significantly, and both the magnitude and phase losses during validation are very odd.
However, there is nothing wrong with the loss calculation code you provided. May I ask if the training was normal on the VoiceBank+DEMAND dataset?
I am currently training using only the DNS dataset.
I will download the 16kHz version of the VoiceBank+DEMAND dataset from your repository, apply it, and share the loss graph with you.
ok
I am currently training using only the DNS dataset.
I will download the 16kHz version of the VoiceBank+DEMAND dataset from your repository, apply it, and share the loss graph with you.
If you can obtain the results for DEMAND in paper when you use the g_best model?
Oh, I forgot to report the experimental results. I am really sorry about that. I only used the MP-SENet model and trained it in my training framework, and I have not used the g_best model.
Currently, the above issue has been resolved, and I have achieved a result of 3.41 WB-PESQ on the DNS dataset and 3.44 WB-PESQ on the VoiceBank+DEMAND dataset.
Since our model is based on self-attention, the segment size may have an impact on the overall performance, the tensorboard log should be like this:
“Since our model is based on self-attention, the segment size may have an impact on the overall performance,” I check the AttentionModule implement, It seems like time attention sequence length is the length of segment? why need such a long lenght?
Hi, i try to train the model from scratch, but it fail to reproduce the result in VoiceBank_DEMAND, but the training is unstable and result is bad, can you give some advice. Thanks.
the tensorboard log is:![image](https://github.com/yxlu-0102/MP-SENet/assets/13691793/b5b4d094-42ac-4d3a-8a12-1f43c6a4f49e)
the only different is the config, and the diff result is: