Closed gibiansky closed 3 years ago
You could, but it wouldn't be a big improvement, and likely a big degradation. Each frame must be decodable as-is with no previous frames. Which means no per-frame adaptation, unless you hack in keyframes and P-frames (please don't). So your adaptation would be limited to intra-frame only. Since most data except random bits and the header is writing the PVQ pulses, that would be your main target. These don't even use a CDF however, since the data there is already at a sufficiently high entropy to not benefit much from one. You'll need to rip out the entire code in encode_pulses and decode_pulses and replace it. The hard part would be figuring out what to put there that's more efficient than what we already have, since the current scheme can neatly encode a gigantic 172-entry vector of floats in a single 32-bit symbol (if necessary, usually the vector is smaller) with zero prediction. Even if you do figure out a better way (the old CELT codec had a slightly better scheme than this, if slightly insane) and make it encode via a CDF, neural network black box magic wouldn't be of much help. The coefficients always follow a Laplacian distribution, so based on the position of the coefficient, the length of the vector and a few other parameters you could just build a static CDF yourself which fits the data reasonably well and is completely numerically stable and verifiable. I think all of this is mostly a dead end though. An easier way to get better compression would be via applying Daala-style PVQ prediction to the vector for every PVQ split. All of this is completely theoretical by the way. Opus doesn't have a version field nor any reservations to make future modifications - the codec is frozen solid. So any changes would be as a non-standard fork. And an even easier way would be to write a new encoder which adjusts the parameters better! You have no idea what hacks and heuristics libopus makes. And it wouldn't even result in a fork.
Finally, the issue tracker isn't the place to post this at all. There's an IRC channel. I'm not on there since its 99.9% user questions rather than codec development. You could go to #daala on freenode, the channel is quiet enough and the original people behind Opus (they're not active these days) are there and I'm there as well.
Thanks! I had no idea there was an IRC channel, it's not listed very prominently anywhere as a point of contact. Closing this issue, then.
Each frame must be decodable as-is with no previous frames. Which means no per-frame adaptation, unless you hack in keyframes and P-frames (please don't).So your adaptation would be limited to intra-frame only.
Yeah, this definitely makes it impossible to get much improvement from this approach; the whole and only point of using recurrent predictors a la WaveRNN is to have your CDF be dependent on a very long (multi-frame) context. If it's not possible to hack Opus to do this, doing this work within Opus is certainly a dead end.
Since most data except random bits and the header is writing the PVQ pulses, that would be your main target.
Thanks, that's valuable to know.
I think all of this is mostly a dead end though. An easier way to get better compression would be via applying Daala-style PVQ prediction to the vector for every PVQ split.
Thanks for the pointer.
By the way, my comment was mainly targeting CELT. Doing this for the Silk part of the codec might be more beneficial, since Silk is a pretty ordinary looking speech codec, and it has longer packets (60ms). You'll still be limited by the fact you can only have keyframes though, unless you're doing this for pure theoretical work, since pretty much all multimedia code assumes any audio packet is standalone decodable. Unfortunately, I only know of Koen Vos (the Silk author) or Jean Marc-Valin (@jmvalin) who understand Silk, so you'll have to find and ask either to get more info.
I am trying to understand a realistic path to a deployed audio codec with a neural network predictor in it. I am unfortunately fairly new to the internals of audio codecs, coming at this from the other side, and so am trying to understand more here.
Prior work on this has more or less entirely disregarded codecs and focused either on (a) using features extracted by a codec and feeding them to a neural vocoder or (b) using arithmetic coding on the audio itself. Neither of these feels like a practical solution to me as modeling audio at its frame rate is very computationally intensive and doesn't take advantage of any of the work done on developing modern audio codecs.
Deep neural nets are great at modeling long-term context in sequences and can produce accurate PDFs, which made me interested in the range coding step in Opus; however, as you point out, this is not cross-frame, and that more or less defeats the whole point of having a network that's capable of predicting 10s to 100s of ms of audio accurately. I believe that I misinterpreted the graphic in the RFC and assumed that there is a single range coder that is applied to the stream right before it becomes a bitstream (rather than the range encoder being used separately by various parts of the code, as I observed when scanning libopus C code):
Any pointers are of course appreciated, but I think it's unlikely that this is really an Opus question, since it doesn't sound like Opus is hackable (even on a fork!) to allow cross-frame modeling...
Have you looked at https://github.com/drowe67/codec2? It's designed for very-low-bitrate voice, where longer-term modelling might be more beneficial if you did an interframe version. I've no idea on the internals though, and the bitrate constraint is very punishing...at some point you might as well be doing speech-to-text! But e.g. you might be able to do better voicing estimation than the current heuristics.
The first question is whether your interest is academic or practical. Deploying a new codec isn't a simple thing. It requires a lot of effort and you'll need a codec that provides a significant improvement over the previous version. The neural probability modelling thing is mostly an academic exercise IMO. If you really do a good job at modelling the CDFs, I expect you should be able to save maybe 1 or 2 kb/s at most (possibly much less). We're far from what justifies a new (incompatible) format, but for an academic exercise you don't care about these sorts of things.
In terms of using deep learning in codecs, I think there's much more future in dealing with the signals themselves. See this 1.6 kb/s wideband vocoder I worked on some time ago for example: https://jmvalin.ca/demo/lpcnet_codec/
Hi! I am a researcher working on audio, including audio compression, using AI.
I am wondering:
Would it be possible / feasible to replace the static range coding PDFs used in opus with dynamically predicted ones using a deep recurrent net? (Similar to the VAD) The concept is that using a deep net to estimate the probability distribution over the alphabet can result in higher compression ratio.
I've read through the opus spec and some of the source and found a few places that use the range coder, and it seems hypothetically possible. However, I am having a hard time finding anyone who is truly familiar with the internals of the opus codec to help me determine if this path is likely to be fruitful.
Thanks!