weltkante / managed-lzma

C# implementation of LZMA and 7zip
MIT License
140 stars 22 forks source link

Get feedback on amount of unflushed data in buffers #20

Closed weltkante closed 7 years ago

weltkante commented 7 years ago

If you have a strict limit on archive size you need a way to decide whether you can add another file without exceeding your limit. Since the LZMA buffers can be very large this is very hard to predict without assistance of the library because you have to assume the worst case scenario, which is uncompressible files. If the files did actually compress then the worst case estimate means closing the archive way before reaching your strict limit.

It would be nice if the library could expose a way to inspect the amount of data in the buffers and give lower and upper bounds on the output length under the assumption that no further input follows.

weltkante commented 7 years ago

After researching this in more depth I've decided to not implement this because it is too complicated. I'm leaving the results of what I found here in case I may come back to it later, or if someone else wants to look into it.

For LZMA you can inspect the following variables to determine (a) how much buffered input has been read from the input source but not been processed and (b) how much buffered output has not been written to the output sink:

// mEncoder is `Master.LZMA.CLzmaEnc` - for example the member variable in `AsyncEncoder`

// the number of (unprocessed) input bytes in the LZMA encoder
var pMatchFinder = mEncoder.mMatchFinderBase;
var pMatchFinderCachedBytes = pMatchFinder.mStreamPos - pMatchFinder.mPos + fetched;

// the number of (unflushed) output bytes in the LZMA encoder
var pRangeCoder = mEncoder.mRC;
var pRangeCoderCachedBytes = pRangeCoder.mBuf - pRangeCoder.mBufBase;

At this level the problem is primarily that the variables are not synchronized. It may be possible to turn all writes against them to volatile writes, but it should be measured if volatile writes have negative performance impact when the feature is not used (i.e. when there are no volatile reads). If so, then it shouldn't be done. Reading the variables without synchronization is of course always a last resort option.

Above research only is for the LZMA encoder. It can probably be extended to LZMA2 easily, but I have not tried. The real problem is extending it to 7z encoders, which is what the actual use case of this feature required. 7z encoders can be configured in arbitrary graphs and there is no obvious way to how report and interpret buffers between nodes.