xiph / flac

Free Lossless Audio Codec
https://xiph.org/flac/
GNU Free Documentation License v1.3
1.62k stars 277 forks source link

[question] Forcing only `verbatim` and `constant` modes #656

Closed vadimkantorov closed 6 months ago

vadimkantorov commented 10 months ago

Is it possible to force the encoder to use only verbatim and constant modes?

The idea is to have a bitstream encoding lpcm-like raw audio without any entropy/residual coding, but compressing exact long silence intervals very efficiently by using run-length encoding. So I though that it's possible to (ab)use the FLAC bit stream format (and the flac program) for standardized representing LPLCM raw audio coding, but with optimization for exact silence (multi-person conference call records are a typical example of such audios).

Thanks a lot :)

ktmf01 commented 10 months ago

There is an undocumented testing options for that. You can use -l 0 to disable LPC, and --disable-fixed-subframes to disable fixed subframes. However, this option is a testing option and may change or be removed at any time.

Why would you want to do this? FLAC's compression is very efficient and very, very fast. What do you hope to gain by disabling compression?

vadimkantorov commented 10 months ago

Thank you for your explanations!

Can we force flac to do individual channel encoding for the 2-channel case? or e.g. force flac encoder to encode input channels as separate ogg streams?

and --disable-fixed-subframes to disable fixed subframes.

What is difference between SUBFRAME_FIXED and SUBFRAME_VERBATIM? Yeah, I'd like to have only verbatim and (for encoding the non-silence with PCM) and only constant subframes (for encoding silence efficiently using RLE). Should --disable-fixed-subframes be enough for this, right?

Why would you want to do this?

The usecase is inference of a offline/batch-processing speech recognition pipeline (which is run server-side on GPUs). The inputs are opus-encoded 15-60 minute phone call files. To maximize the perf and utilisation of GPUs, we need to feed them with decoded audio fast. Since there are no currently existing GPU opus decoders, we need to move opus decoding on a separate set of big multicore CPU machines. And then to transfer the decoded audios to the GPU machines. Since these are phone call / conference call recordings, the decoded channels contain a lot of silence which wastes network bandwidth.

This can be solved by using some RLE-enabled format. I thought that flac format can be used for this goal, as further processing of such format can be done fast and in parallel and be almost directly fed into the speech recognition model (which consumes wave forms).

Ideally, for this usecase (of saving bandwidth between opus-decoding machines and speech-recognition machines) we just need parallelized/nonsequential, vectorized/making-use-of-AVX and maximally simple, hackable decoding (ideally supporting batches of multiple audios) - maybe as in http://users.cecs.anu.edu.au/~Eric.McCreath/papers/YeMcCreathISPA2018.pdf or https://github.com/Harinlen/GPUraku / https://github.com/peplin/pflac. Also, on GPU we can use some iDCT (or integral iDCT) or some other linear transform like this in very efficient way. As we don't need to playback the file in this usecase, we can use some vectorized scans on the whole file, and having a format allowing to make use of this is useful. It would be super-interesting if some meaningful decoding might be achieved by some tricky usage of numpy/pytorch ops - then one could use modern fusion features like torch.compile for making this code portable and sped-up.

H2Swine commented 10 months ago

As for FLAC encoding by GPU, there is FLACCL in the https://github.com/gchudov/cuetools.net / http://cue.tools application. But then the reference flac.exe's syntax is out.

Can we force flac to do individual channel encoding for the 2-channel case?

--no-mid-side , as used in presets 0 and 4, will encode left and right separately. (If it isn't stereo, FLAC must encode all channels separately.)

and --disable-fixed-subframes to disable fixed subframes.

What is difference between SUBFRAME_FIXED and SUBFRAME_VERBATIM?

FLAC can accommodate linear prediction + residual in one of two ways. The "LPC" subframes store the weights for past samples. the "FIXED" subframes use one that is specified by the format. For example, predicting the next sample as 2*current - previous. (Corresponds to "current" + "rate of change.) One advantage aside from speed, is that it can be reconstructed using only addition and subtraction, without multiplication.

Yeah, I'd like to have only verbatim and (for encoding the non-silence with PCM) and only constant subframes (for encoding silence efficiently using RLE). Should --disable-fixed-subframes be enough for this, right?

In combination with -l0 - yes nearly. If there are wasted bits then I think those will be encoded separately. Say if you store a 22-bit signal as 24-bit WAVE and encode it; then if I remember correctly, pass these options will force it to store 22 bits as verbatim, plus for each subframe a flag saying "and pad up with (edit:) two zeroes". I don't think the reference encoder can disable that, but ktf has had to correct me a ton of times. (edited for unambiguity, I originally wrote "20", but 20 is one of the pre-defined bit depths in FLAC)

-l0 and --no-mid-side are invoked by the -0 preset, so you can pass -0 --disable-fixed-subframes

Since these are phone call / conference call recordings, the decoded channels contain a lot of silence which wastes network bandwidth.

If "silence" means "digital silence", then CONSTANT will help. If there is even a tiny little variability in the LSB, then you won't get much compression if you disable fixed. You are decoding opus, and that is typically done to 32-bit floating-point, which both is a lot of data and a format that FLAC cannot store. As long as you have control over volume - how low are the quietest useful whispers? - you can decimate it down to [suitable number of bits], without dithering. But you likely have to.

AnterCreeper commented 9 months ago

I am about to generate some test cases, so is there any way to generate some fixed subframes with zero order?

AnterCreeper commented 9 months ago

There is an undocumented testing options for that. You can use -l 0 to disable LPC, and --disable-fixed-subframes to disable fixed subframes. However, this option is a testing option and may change or be removed at any time.

Why would you want to do this? FLAC's compression is very efficient and very, very fast. What do you hope to gain by disabling compression?

I hope this testing option could be kept, although it is rarely used.

ktmf01 commented 6 months ago

I am about to generate some test cases, so is there any way to generate some fixed subframes with zero order?

No, not with the stock encoder.

I hope this testing option could be kept, although it is rarely used.

Yes, I see currently no reason to remove it. It is rather useful for the test suite.