Streaming zip data - Githubissues

zardoy commented 2 months ago

Issue: I have a 32 gb ZIP archive and I think I should not read the entire file and load it to ram just to look at the list of files within it.

AFAIR with the modern file reader API it is possible to read the file by chunks, but would it be possible to integrate some of these solutions here? Is there a right way to work with large ZIPs efficiently?

james-pre commented 2 months ago

This could be possible with IndexFS, which is extended by ZipFS. The file reader API is for use with the File System access API, so I don't know how that would work with ZipFS. Right now, ZipFS loads the entire file into memory.

The zip file is laid out like this:

file 1
...
file n
archive decryption header
archive extra data record
central directory header 1
...
central directory header n
zip64 end of central directory record
zip64 end of central directory locator
end of central directory record

This means that it would be difficult to stream since you would need to start from the end of the zip data.

james-pre commented 2 months ago

@zardoy,

Today I overhauled the internals for processing zip files (check out v0.3.0) and found some interesting things in the zip spec. I've gained a much better understanding of how zip files work. Section 4.3.5 of the spec caught my eye since it mentions streaming.

My thoughts on how streaming can be implemented:

Since the zip "header"/"end of central directory" is located at the end of the file, some metadata will not be known until the entire file has been streamed. However, the LocalFileHeader (which comes almost immediately before the file data) contains the same metadata as the FileEntry (which occurs after all the files). This metadata may be enough to load and decompress an entire file.

FileEntry.data is what actually parses the file data, which primarily uses values from LocalFileHeader. Adapting FileEntry.data to LocalFileHeader is easy enough:

public get data(): Uint8Array {
    const data = this.zipData.slice(this._offset + this.size);
    const decompress = decompressionMethods[this.compressionMethod];
    // decompress validation check not included for readability
    return decompress(data, this.compressedSize, this.uncompressedSize, this.flag);
}

All that needs to be done is to pass the zip data buffer to LocalFileHeader (since it is not included right now), and to get the offset in that buffer of this (i.e. the current local file header).

Note that even if streaming is possible, you wouldn't be able to ZipFS without loading it all into memory still (since the entire buffer must be passed to the FS).

This is still in the early stages, but adding streaming support is workable. I hope this has helped.

- JP

zardoy commented 2 months ago

This is still in the early stages, but adding streaming support is workable

Wow great news, thanks! What do you think of adding support for File option to ZIP fs backend options? it has .slice() support for retrieving data by chunks...

james-pre commented 2 months ago

File.slice is inherited from Blob.slice. From what I can tell, it works off of the Blob which is already in memory. This would mean your zip file would already be loaded into memory.

ZipFS doesn't copy the buffers, though it does copy all of the data when parsing a struct (from the buffer to the members).

Perhaps it would be possible to access the members on the view directly, though that could get complicated since the struct decorators would need to intercept get/set calls. Feel free to look at utilium!Struct and utilium!struct

zardoy commented 1 month ago

File.slice is inherited from Blob.slice. From what I can tell, it works off of the Blob which is already in memory. This would mean your zip file would already be loaded into memory.

From my point of view, I can tell that loading 1GB zip with file.arrayBuffer() obviously takes 1GB of memory (and calling file.slice() reads the requested offset directly from FS without fully loading file into the memory). And then, calling configure with ZIP backend makes it use 3GB of ram (sometimes it doesn't go down even after reload). I'm really not sure why it goes so high, but if there is a chance .slice can be used or any other optimizations can be made I definitely need it! (because right now because of this RAM usage I can't use this module on iOS at all).

james-pre commented 1 month ago

I'm currently working on releasing core 0.11 and the Emscripten backend. After that, I would be happy to address the ZIP backend. Hopefully that does not delay your project too much.

... calling file.slice() reads the requested offset directly from FS without fully loading file into the memory

This actually makes a lot of sense. I apologize if I mistakingly thought the entire blob was preloaded.

I'm really not sure why it goes so high, but if there is a chance .slice can be used or any other optimizations can be made I definitely need it!

I will see what I can do, though processing ZIP files is convoluted already so I'm not sure what other optimizations I can make.

zen-fs / zip

Streaming zip data #3

My thoughts on how streaming can be implemented: