xiph / rnnoise

Recurrent neural network for audio noise reduction
BSD 3-Clause "New" or "Revised" License
3.99k stars 889 forks source link

Better understanding the feature extraction process #13

Open alebzk opened 6 years ago

alebzk commented 6 years ago

Hi, first of all thank you for your amazing work!

I've been looking at the code and I am interested in better understanding the feature extraction process. I have some questions, I will do my best to be as detailed as possibe.

Let me start from compute_frame_features() in denoise.c. The first step is computing the energy in the Opus bands (done in frame_analysis()) and then the pitch is estimated/tracked. pitch_downsample() performs the 20ms frames downsampling by halving the samples using a [0.25, 0.5, 0.25] kernel around the even samples. Is this a less expensive way to perform downsampling and low-pass filtering jointly to avoid aliasing?

Then _celt_autocorr() is called with a lag of just 4 samples on the downsampled sequence (hence, 24kHz). Why just 4? What is achieved exactly with such a low lag? And if I look at _celt_autocorr(), I don't understand why the autocorrelation computed by celt_pitch_xcorr() is modified afterwards by summing the autocorrelation for different lags. After that, the autocorrelation is further modified once _celt_autocorr() returns (below the // Noise floor -40 dB comment). Why is that done? And finally _celt_lpc() is called and the LPC coefficients modified (I mean lpc2) and used to filter the downsampled sequence via celt_fir5(). This whole part is a bit obscure to me and it's also hard to understand where some constants come from (e.g., ac[0] *= 1.0001f; and ac[i] -= ac[i]*(.008f*i)*(.008f*i);) - I've found some possible mappings between the dB and the linear scale, but I'm not fully sure. Overall, pitch_downsample() looks like a pre-processing step before the pitch is sought in pitch_search(). It would be great if you can share details on what is done.

My apologies if I am asking something that may be obvious to others. I'm to some extent familiar with LPC and auto-correlation, a little bit with pitch tracking. That's probably why I can't grasp all the details in the code.

Cheers, Alessio

alebzk commented 6 years ago

Hi again,

It looks like the pitch estimation method you implemented is SIFT [1]. Right? If there are other relevant details on the exact method, could you explain and/or share a reference?

Alessio

[1] Markel, J. (1972). The SIFT algorithm for fundamental frequency estimation. IEEE Transactions on Audio and Electroacoustics, 20(5), 367-377.

alebzk commented 6 years ago

While waiting for an answer, I've checked the code better and managed to answer some of my questions. Plus, I got new ones.

This is what I've learned:

My remaining questions are the following:

Alessio

jmvalin commented 6 years ago

@alebzk the pitch estimation code is mostly copied from Opus, so it's quite possible everything in there isn't optimal for rnnoise. The idea as you seem to have guessed is to compute the auto-correlation on the residual. The (.008fi)(.008f*i) term here is means to approximate a Gaussian for lag windowing (stabilizes the LPC analysis). As for the c1 term, it's meant to convolve the LPC filter with a slight low-pass filter to make the analysis better. I'm sure it would be possible to run the pitch even lower than 12 kHz, but that seemed fine for the use I had.

Why is pitch_search() called passing PITCH_MAX_PERIOD-3*PITCH_MIN_PERIOD instead of PITCH_MAX_PERIOD for the max_pitch argument?

This avoids searching for very short periods, which can sometimes cause false detection due to formants. The very short periods are searched through remove_doubling()

In find_best_pitch() the numerators are xcorr[i]^2, why not just xcorr[i]

Yes, because we want to maximize xy/sqrt(xx*yy) but rather than compute a sqrt(), we just square everything and since the xx term is constant, we only need to maximize xy^2/yy

remove_doubling() uses second_check[16] = {0, 0, 3, 2, 3, 2, 5, 2, 3, 2, 3, 2, 5, 2, 3, 2}, it looks like these are multipliers used to look for specific (sub)harmonics; where do those values come from?

They were hand-tuned to look for expected peaks that wouldn't be expected for a different pitch period. For example, if there's a peak at T/6, then we can expect one at 5*T/6, which is a position you wouldn't expect to find a peak if the period was T/3.

remove_doubling() looks at higher harmonics of the initial estimated pitch period, hence removing pitch period doubling errors; why there's no removal for halving errors?

Period doubling (or tripling) is a common error for auto-correlation-based pitch estimators since if there's a periodicity at T, then there's also going to be a periodicity at 2T and 3T, ... OTOH, there won't be a periodicity at T/2.

alebzk commented 6 years ago

Thanks so much for the detailed answer!

zhly0 commented 6 years ago

hi,@alebzk: from your issue,I learned a lot,I am new to pitch detection.If my training sample's sample rate is 16000,is the macro define PITCH_BUF_SIZE need to change?What is the relation between the sample rate and the PITCH_BUF_SIZE? expecting your reply,Thanks!

alebzk commented 6 years ago

Hi @zhly0,

@jmvalin can surely say more, but I'm happy to give a first answer.

Yes. PITCH_BUF_SIZE unit is number of samples at 48 kHz; hence, it has to be adapted if the sample rate changes. However, note that the whole pitch estimation algorithm is kind of specific for the 48k case (there are 2 downsampling steps and the estimated pitch is used to compute spectral the cross-correlation features). If you want to avoid resampling from 16k to 48k, then part of the code must be adapted.

Alessio

teslam commented 6 years ago

@alebzk thanks for your write up, very helpful (altough im just half way through). (and thanks to jmvalin ofcourse!)

I dont understand your comment: "The autocorrelation with a maximum lag of 4 is computed since the pitch is estimated not on the audio frames directly, but on their LP residuals." I dont understand where the residual is calculated (residual as in difference between x_lp and lpcestimated). Are you referring to: opus_val32 sum = SHL32(EXTEND32(x[i]), SIG_SHIFT)?

Thanks again

jmvalin commented 6 years ago

See this call: celt_fir5(x_lp, lpc2, x_lp, len>>1, mem); in pitch_downsample.c

teslam commented 6 years ago

@jmvalin thanks! I have gone through that function plenty of times but double checked now when you wrote, and I think my problem comes from the fact that lpc coefficients are defined with opposite sign compared to how I thought they were.

I now assume that you define them as e.g. Matlab do: https://se.mathworks.com/help/signal/ref/lpc.html.

Thanks.

shakingWaves commented 4 years ago

@jmvalin @alebzk I do not understand the pitch estimation method,especially, the "pitch_search" function in pitch.c. can you share me some details?

zuowanbushiwo commented 2 years ago

@shakingWaves @zhly0 @jmvalin @alebzk I also do not understand the pitch estimation method, If my training sample's sample rate is 16000, how to change pitch_downsample,pitch_search,remove_doubling those funtions? Thanks!