yxlu-0102 / MP-SENet

Explicit Estimation of Magnitude and Phase Spectra in Parallel for High-Quality Speech Enhancement
MIT License
323 stars 45 forks source link

Bandwidth Extension Data-flow #50

Open eliran-fm opened 2 months ago

eliran-fm commented 2 months ago

Hi, I have two questions regarding the adaptation of mp-senet for the BWE task:

  1. You mention replacing the learnable sigmoid with a PRelu as the final activation in the MaskedDecoder. Should it be initialized as a single-parameter PRelu or rather resemble the way the sigmoid was initialized (with n_fft//2+1 parameters)?
  2. Does the input have to be already naively upsampled to the target SR? If so, is this relating to the part in the paper where you mentioned the spline upsampling (from 4/8khz back to 16khz)?

Thanks

yxlu-0102 commented 2 months ago
  1. I defined the PReLU as follows: self.prelu = nn.PReLU(h.n_fft//2+1, init=-0.25)

  2. Yes, all the inputs are upsampled to the target SR, which follows the implementation of https://github.com/kuleshov/audio-super-res.

eliran-fm commented 2 months ago
  1. I defined the PReLU as follows: self.prelu = nn.PReLU(h.n_fft//2+1, init=-0.25)
  2. Yes, all the inputs are upsampled to the target SR, which follows the implementation of https://github.com/kuleshov/audio-super-res.

Thanks @yxlu-0102 Following 2., is the usage of audio-super-res for downsampling supposed to provide a more realistic narrowband version of the inputs?

yxlu-0102 commented 2 months ago
  1. I defined the PReLU as follows: self.prelu = nn.PReLU(h.n_fft//2+1, init=-0.25)
  2. Yes, all the inputs are upsampled to the target SR, which follows the implementation of https://github.com/kuleshov/audio-super-res.

Thanks @yxlu-0102 Following 2., is the usage of audio-super-res for downsampling supposed to provide a more realistic narrowband version of the inputs?

I think the usage of audio-super-res for downsampling would lead to aliasing of high-frequency components, but for a fair comparison, we used it in this paper.

In our another work of BWE, we switched to using the sinc filter for downsampling and interpolation operations to avoid this aliasing.