zip-rs / zip2

Zip implementation in Rust
Other
93 stars 31 forks source link

Regression: opening large zip files is slow since 2.1.4 because the entire file is scanned #231

Open ttencate opened 1 month ago

ttencate commented 1 month ago

Describe the bug

I have a 266 MB zip file, from which I only need to extract a 1 kB file. The rest of the files in the archive are irrelevant at this stage in the program.

However, opening the zip file using ZipArchive::new(file) takes about 7 seconds. It's a lot faster the second time round, because of Linux's filesystem cache.

I traced the root cause to Zip32CentralDirectoryEnd::find_and_parse, which locates the "end of central directory record" very quickly at the end of the file, but then keeps scanning backwards through the entire file to find another one.

To Reproduce

Have a large zip file:

$ ls -lh archive.zip
-rw-r--r-- 1 thomas thomas 266M Aug  8 12:20 archive.zip
$ cargo build --release
$ echo 3 | sudo tee /proc/sys/vm/drop_caches  # Flush filesystem cache (Linux only)
$ time target/release/repro
real    0m6.714s
user    0m0.560s
sys 0m1.293s

Use this as the main program:

fn main() {
    let file = std::fs::File::open("archive.zip").unwrap();
    let archive = zip::ZipArchive::new(file).unwrap();
}

Expected behavior

Extracting a single 1 kB file from a large archive should be possible quickly. unzip can do it:

$ echo 3 | sudo tee /proc/sys/vm/drop_caches
$ time unzip -l archive.zip
Archive:  archive.zip
  Length      Date    Time    Name
---------  ---------- -----   ----
...
---------                     -------
1228949561                     9 files

real    0m0.012s
user    0m0.005s
sys 0m0.000s
$ echo 3 | sudo tee /proc/sys/vm/drop_caches
$ time unzip archive.zip some_file.txt
Archive:  archive.zip
  inflating: some_file.txt           

real    0m0.012s
user    0m0.000s
sys 0m0.005s

Version

zip 2.1.6. This is also happening in 2.1.4, but not in 2.1.3. I think cb2d7abde7863a4ce01dbac5b3b48b4006e60599 or 9bf914d7d41842b381d303becf5364b5b2b8c1f2 is the cause, but I haven't dug deeper.

RisaI commented 1 month ago

Can also confirm the regression. In our case, the difference is extreme (by an order of magnitude).

electimon commented 1 month ago

Also encountering this with extracting single tiny file from multiple small zip files (>9000 count), i thought i was going crazy, in my case llseek seems to be taking up alot of cputime.

newinnovations commented 1 month ago

I can also confirm. Extracting a 109 KB file from a 200 MB archive:

In 2.1.3:

       extract (zip)    0.2 ms (109 KB)

In 2.1.6:

       extract (zip)  675.5 ms (109 KB)
newinnovations commented 3 weeks ago

In 2.2.0:

       extract (zip)  683.4 ms (109 KB)