xiph / LPCNet

Efficient neural speech synthesis
BSD 3-Clause "New" or "Revised" License
1.12k stars 295 forks source link

The 18 Bark-scale frequency bins are not normalized, introduces spectral tilt. #213

Closed brian-smith-github closed 2 months ago

brian-smith-github commented 2 months ago

Hello, thanks for making LPCNet open source. When playing with it, I noticed that in the code to generate the 18 bark-scale bands: lpcn_compute_band_energy() from freq.c there is no normalization based on the band widths, so the wide top bands generate much higher output than the narrow first few bands. This seems to be generating a 3dB tilt from the bottom to the top band when processing white noise. This inherent tilt might be hindering higher compression ratios in the DCT perhaps.

In speech recognition land (e.g. Whisper) the bands (usualy mel-scale instead of bark) are by default scaled by area (width) to keep the frequency response as flat as possible ('Slaney' normalization).

See: https://librosa.org/doc/main/generated/librosa.filters.mel.html

(I might be wrong on the outcome, I just though it was worth pointing out)

brian-smith-github commented 2 months ago

Actually, most of the tilt (2dB) is generated by the 0.85 pre-emphasis, however, there is still a small amount of improvement when the energies are divided by the width of the band.

brian-smith-github commented 2 months ago

Actual speech data actually looks flatter without the normalize than with... so I'll close this, I get the impression normalizing would not help after all... (has this idea already been tried and discarded?)