Get unused data from end of DecompressionStream?

Tschrock commented 3 years ago

I'm not very familiar with streams and compression, but hopefully this is understandable.

For deflate, the spec states "It is an error if there is additional input data after the ADLER32 checksum." For gzip, the spec says "It is an error if there is additional input data after the end of the "member"."

As expected, Chrome's current implimentation throws a TypeError ("Junk found after end of compressed data.") when extra data is written to a DecompressionStream.

This error can be caught and ignored, but there doesn't seem to be a way of retrieving the already-written-but-not-used "junk" data. There seems to be an assumption here that developers already know the length of the compressed data, and can provide exactly that data and nothing more. On the contrary, this "junk" data can be very important in cases where the compressed data is embedded in another stream and you don't know the length of the compressed data.

A good example of this is Git's PackFile format, which only tells you the size of the uncompressed data, not the compressed size. In such a case you must rely on the decompressor to tell you when it's done decompressing data, and then handle the remaining data.

My attempt at putting together an example:

// A stream with two compressed items
// deflate("Hello World") + deflate("FooBarBaz")
const data = new Uint8Array([
    0x78, 0x9c, 0xf3, 0x48, 0xcd, 0xc9, 0xc9, 0x57, 0x08, 0xcf, 0x2f, 0xca, 0x49, 0x01, 0x00, 0x18, 0x0b, 0x04, 0x1d,
    0x78, 0x9c, 0x73, 0xcb, 0xcf, 0x77, 0x4a, 0x2c, 0x72, 0x4a, 0xac, 0x02, 0x00, 0x10, 0x3b, 0x03, 0x57,
]);

// Decompress the first item
const item1Stream = new DecompressionStream('deflate');
item1Stream.writable.getWriter().write(data).catch(() => { /* Rejects with a TypeError: Junk found after end of compressed data. */ });
console.log(await item1Stream.readable.getReader().read()); // "Hello World"

// How do I get the remaining data (the "junk") in order to decompress the second item?
// I've already written it to the previous stream, and there's nothing to tell me how much was used or what's left over.
const item2Stream = new DecompressionStream('deflate');
item2Stream.writable.getWriter().write(getRemainingDataSomehow());
console.log(await item2Stream.readable.getReader().read()); // "FooBarBaz"

Now, as a workaround, I could write the data to my first stream one byte at a time, saving the most recently written byte and carrying it over when the writer throws that specific exception - But writing one byte at a time feels very inefficient and adds a lot of complexity, and checking for that specific error message seems fragile (it might chage, and other implimentations might use a different message.)

Zlib itself provides a way to know what bytes weren't used (though I don't know any details about how.) Python's zlib api provides an unused_data property that contains the unused bytes. Node's zlib api provides a bytesWritten property that can be used to calculate the unused data. It would be great to have something similar available in the DecompressionStream api.

ricea commented 3 years ago

This a difficult problem.

In the special case of the packfile format, we could solve it with a mode which continues decompressing after it reaches the end.

In the general case, something like bytesWritten would work, although it would require the caller to keep the input chunk around until it knew whether it had been completely consumed or not, and keep track of how much data it had injected so far.

saschanaz commented 3 months ago

We also got internal feedback about this API: https://bugzilla.mozilla.org/show_bug.cgi?id=1901316#c3

Could we throw something like DecompressionError that extends TypeError with additional fields with decompressed data so far + unused data?

ricea commented 3 months ago

I feel like there should be an enum, something like

enum ExcessDataBehaviour {
  "error",
  "discard",
  "decompress",
  "something-else",
};

where "error" is our default behaviour, "discard" calmly gets rid of trailing junk, "decompress" tries to decompress it as if it was a new stream, and "something-else" deals with exotic cases.

My idea for "something-else" is that DecompressionStream could have a second readable stream on it which only returns data if there was some after the end of compressed input.

For example, suppose your input consisted of gzip-compressed data followed by uncompressed data and you just wanted to concatenate it. You would do something like

const ds = new DecompressionStream('gzip', { extraData: 'use' });
response.body.pipeTo(ds.writable);
await ds.readable.pipeTo(destination, { preventClose: true });
ds.extraData.pipeTo(destination);

What do you think? We could potentially implement the first three options while still bikeshedding the fourth option.

saschanaz commented 3 months ago

Hmm, throwing exception certainly wouldn't work well with pipes.

I wonder we could make pipes reusable; allowing destination to be closed gracely so that the source side can still be piped to elsewhere. Could be more general than decompression specific and also could deal with "exotic" case you mentioned. But then it's not clear what would be done with the partially consumed chunk.

whatwg / compression

Get unused data from end of DecompressionStream? #39