Question on Bandwidth extension task formulation

HimaJyothi17 commented 2 weeks ago

I have a question regarding the BWE. My apologies, if my question doesn't make sense.

It was mentioned in the journal that, "For BWE, we use the PRelu activation to predict an unbounded high-frequency magnitude mask".

Question: Input narrow-band signal has no high-frequencies, means zeros in the high freq. If we multiply the high freq mask predicted by magnitude decoder with input magnitude,it'll result zeros in high freq. Then, how are we achieving bandwidth extension with this architecture?.

And also, all these 3 tasks( denoising, dereverberation and BWE) are trained independently?

Thanks for your time and patience in advance !

yxlu-0102 commented 2 weeks ago

In our other work on speech bandwidth extension, we used narrowband log-magnitude spectra as input, predicted the high-frequency log-magnitude spectra, and added them together to obtain the wideband log-magnitude spectra. Since adding log-magnitude spectra is equivalent to multiplying magnitude spectra, we found that bandwidth extension can be achieved by applying an unbounded mask to the magnitude spectra.

Regarding your question, the high-frequency part of the magnitude spectrum of a speech waveform is a very small decimal close to zero after upsampling. Therefore, a large-value mask can be used to predict the high-frequency magnitude spectrum. Here, we also applied power-law compression to narrow the range of this mask, making it easier to predict.

Additionally, in the paper, the models for these three tasks were trained separately. We also tried training a general model using all the data to handle these three tasks simultaneously. We found that the performance of this model slightly decreased in the tasks of speech denoising and bandwidth extension, but it improved in the dereverberation task. This improvement might be due to the inclusion of noisy data, which acts as data augmentation.

JangyeonKim commented 1 week ago

I have some questions about BWE task.

Currently, I am trying to apply the MP-SENet model to the BWE task. As written in the long version of the paper, I am conducting experiments with the VCTK dataset.

I changed the lsigmoid() of the mask decoder to prelu(), but the loss becomes nan as soon as the training starts. Leakyrelu() also showed the same phenomenon. So, I am currently using relu for training. Can you provide any advice regarding this issue?

`class MaskDecoder(nn.Module): def init(self, h, out_channel=1): super(MaskDecoder, self).init() self.dense_block = DenseBlock(h, depth=4) self.mask_conv = nn.Sequential( nn.ConvTranspose2d(h.dense_channel, h.dense_channel, (1, 3), (1, 2)), nn.Conv2d(h.dense_channel, out_channel, (1, 1)), nn.InstanceNorm2d(out_channel, affine=True), nn.PReLU(out_channel), nn.Conv2d(out_channel, out_channel, (1, 1)) ) self.lsigmoid = LearnableSigmoid_2d(h.n_fft//2+1, beta=h.beta) self.prelu = nn.PReLU()

def forward(self, x):
    x = self.dense_block(x)
    x = self.mask_conv(x)
    x = x.permute(0, 3, 2, 1).squeeze(-1)

    # # lsigmoid for denoisig, dereverberation
    # x = self.lsigmoid(x).permute(0, 2, 1).unsqueeze(1)

    # PReLU for Bandwidth Extension
    x = self.prelu(x).permute(0, 2, 1).unsqueeze(1)

    return x
    `

When conducting experiments, aside from the metric scores, I found that the output samples contain audible artifacts (buzzing-like sound). I am curious if you have encountered the same issue.

jeffery-work commented 2 days ago

In our other work on speech bandwidth extension, we used narrowband log-magnitude spectra as input, predicted the high-frequency log-magnitude spectra, and added them together to obtain the wideband log-magnitude spectra. Since adding log-magnitude spectra is equivalent to multiplying magnitude spectra, we found that bandwidth extension can be achieved by applying an unbounded mask to the magnitude spectra.

Regarding your question, the high-frequency part of the magnitude spectrum of a speech waveform is a very small decimal close to zero after upsampling. Therefore, a large-value mask can be used to predict the high-frequency magnitude spectrum. Here, we also applied power-law compression to narrow the range of this mask, making it easier to predict.

Additionally, in the paper, the models for these three tasks were trained separately. We also tried training a general model using all the data to handle these three tasks simultaneously. We found that the performance of this model slightly decreased in the tasks of speech denoising and bandwidth extension, but it improved in the dereverberation task. This improvement might be due to the inclusion of noisy data, which acts as data augmentation.

Great job! As mentioned above, the "g_best" file in the "/best_ckpt" was trained for denoise? I found that it has no ability to do bandwidth extension. Will you show your general model for three tasks? I am interested in its PESQ improvement.

yxlu-0102 / MP-SENet

Question on Bandwidth extension task formulation #40