Limited seeking backwards when reading a ZipArchive

XLPhere commented 11 months ago

This is about the fact that ZipArchive requires the reader to implement seeking, which makes sense in many cases, but seems to have the following issue:

In the case where I'm trying to extract a file from a stream of a zip-archive which does not support seeking, i would need to store everything we have read so far, so i can support seeking backwards. For larger files this can get rather impractical when it comes to memory usage.

Is there a way to extract a file from an archive without seeking backwards or with limited seeking backwards (maximum number of bytes we can go back)?

Pr0methean commented 6 months ago

Not without buffering the whole compressed file, since the compression is independent per file but isn't random-access.

0xCCF4 commented 4 months ago

I'm currently working on a project and came over the same problem. I've a nested zip file within a compressed stream that does not support seeking. I can't store the whole compressed stream in RAM, since the files I am dealing with are too big. Then I came across the https://github.com/zip-rs/zip2/blob/27c7fa4cd408bb4cc1364cf599942883371a27fa/src/read.rs#L1594-L1610 function that accepts a non-seekable stream as input. The problem was that I am dealing with ZIP files that have a data descriptor set. As @Pr0methean mentioned safe way would be to buffer the whole compressed file since the actual size can be found in the data descriptor after the compressed file's content.

My next idea was to do just that: buffer the stream content, providing a seekable reader and search for the first zip entry. After that the buffer can be emptied and the next entry can be searched. I implemented this (without the buffering part) in #197 . See https://github.com/0xCCF4/zip2/blob/59f1327e72390813a422e1c7486818da536e9ad7/src/read.rs#L1695-L1728

Then I had the idea, maybe one could just start "proactively" decompressing the content while looking ahead for the data descriptor and if found stop streaming the content. This would have the risk of streaming garbage data when the data descriptor can not be found and reading over the file boundary; but would not require buffering the compressed file data/stream data.

Pr0methean commented 4 months ago

Try using read_zipfile_from_stream().

On Fri, Jun 21, 2024, 07:21 0xCCF4 @.***> wrote:

I'm currently working on a project and came over the same problem. I've a nested zip file within a compressed stream that does not support seeking. I can't store the whole compressed stream in RAM, since the files I am dealing with are too big. Then I came across the

https://github.com/zip-rs/zip2/blob/27c7fa4cd408bb4cc1364cf599942883371a27fa/src/read.rs#L1594-L1610 function that accepts a non-seekable stream as input. The problem was that I am dealing with ZIP files that have a data descriptor set. As @Pr0methean https://github.com/Pr0methean mentioned safe way would be to buffer the whole compressed file since the actual size can be found in the data descriptor after the compressed file's content.

My next idea was to do just that: buffer the stream content, providing a seekable reader and search for the first zip entry. After that the buffer can be emptied and the next entry can be searched. I implemented this (without the buffering part) in #197 https://github.com/zip-rs/zip2/pull/197 . See https://github.com/0xCCF4/zip2/blob/59f1327e72390813a422e1c7486818da536e9ad7/src/read.rs#L1695-L1728

Then I had the idea, maybe one could just start "proactively" decompressing the content while looking ahead for the data descriptor and if found stop streaming the content. This would have the risk of streaming garbage data when the data descriptor can not be found and reading over the file boundary; but would not require buffering the compressed file data/stream data.

— Reply to this email directly, view it on GitHub https://github.com/zip-rs/zip2/issues/162#issuecomment-2182850851, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABF3NBKKULH455HN45UMJWDZIQZGNAVCNFSM6AAAAABJWBLVYCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOBSHA2TAOBVGE . You are receiving this because you were mentioned.Message ID: @.***>

0xCCF4 commented 4 months ago

Yes that is what I tried but this function (also used by the unstable::ZipStreamReader) does not support data descriptors and will fail to read zip files using them (see attached test below).

read_zipfile_from_stream is relying on the zip local entry block before the compressed file contents to figure out the size of the following data. That's fine but inherently does not work if the size is not given in the header but only later in zip data descriptor for the file.

I am therefore working at a change to the unstable::ZipStreamReader to also support those zip files.

#[test]
fn test_stream() {
    let mut v = Vec::new();
    v.extend_from_slice(include_bytes!("data/data_descriptor.zip"));
    let stream = Box::new(io::Cursor::new(v)) as Box<dyn Read>;

    let reader = ZipStreamReader::new(stream);

    struct TestVisitor {}

    impl ZipStreamVisitor for TestVisitor {
        fn visit_file(&mut self, file: &mut ZipFile<'_>) -> ZipResult<()> {
            println!("File: {}", file.name());
            let mut buffer = Vec::new();
            file.read_to_end(&mut buffer)?;
            match String::from_utf8(buffer) {
                Ok(text) => println!(" > {}", text),
                Err(_) => {},
            }
            Ok(())
        }

        fn visit_additional_metadata(&mut self, metadata: &ZipStreamFileMetadata) -> ZipResult<()> {
            println!("Metdata: {:?}", metadata);
            Ok(())
        }
    }

    let mut visitor = TestVisitor {};
    reader.visit(&mut visitor).expect("error");
}

0xCCF4 commented 4 months ago

I probably need some more days until I can finish the pull request #197 that introduces this feature.

0xCCF4 commented 4 months ago

I finished the first revision, and I am looking forward for a review. :slightly_smiling_face:

zip-rs / zip2

Limited seeking backwards when reading a ZipArchive #162