Parallel extraction - Githubissues

horahh commented 1 year ago

I was looking if the library supports any multithreading, would be cool but did not see anything related, has there been any though of that? particularly thinking in something like par_iter to iterate over zip contents

NobodyXu commented 1 year ago

It definitely can be done, but there's one catch: Since the reader is shared among all zip entries, calling seek would also affect other entries reading from reader.

The only ergonomic solution I can think of is

either for zip to require read_at for reader, which is already implemented for File
or requires the reader to implement Clone by storing File to be stored inside an Arc and keep track of the curent location of cursor in the reader
or requires reader to implement TryClone: fn try_clone(&self) -> io::Result<Self> though this makes it harder to pass a trait object

horahh commented 1 year ago

also, wondering is there any reason the implementation to traverse the files in the zip file is done with a count rather than iterator?

NobodyXu commented 1 year ago

It definitely can implement the Iterator, it's probably that the maintainers do not have time to add that themselves.

cosmicexplorer commented 1 year ago

I want to note that with the merge technique from zip-rs/zip-old#401 it becomes possible to parallelize zip creation as well as extraction. I prototyped this in a separate library medusa-zip but if parallel zip creation is of general enough utility I would also love to contribute it to this library as well!

cosmicexplorer commented 1 year ago

The medusa-zip prototype library is async, but I assume we would probably want to convert the parallel zip technique into idiomatic blocking rust for this library of course.

cosmicexplorer commented 1 year ago

@NobodyXu:

It definitely can be done, but there's one catch: Since the reader is shared among all zip entries, calling seek would also affect other entries reading from reader.

Thanks for this explanation! Could we also perhaps employ parallelism by pipelining the decompression/decryption of each entry though and only take a single pass over the file?

NobodyXu commented 1 year ago

Could we also perhaps employ parallelism by pipelining the decompression/decryption of each entry though

I remember that decompression/decryption is done lazily, but you would need to read the source of ask @Plecra or @zamazan4ik for this.

only take a single pass over the file?

Parsing zip file requires reading head and tail of the file, it can be done in a streaming way which the entire file is read only once, but it's only useful when streaming a http request and unzip it.

If you have the zip stores on fs, then seeking it is quite cheap and should be ok, though iterating once is better for caching.

cosmicexplorer commented 1 year ago

I will try to make a benchmark to see if the idea works!

cosmicexplorer commented 1 year ago

I have demonstrated one approach to improve extraction performance with rayon threadpools in zip-rs/zip-old#407, but it has a few caveats and for multiple reasons I am going to try a separate branch that adds the async-executor crate in order to convert some of this blocking work into async tasks.

cosmicexplorer commented 1 year ago

It took a lot of iteration, but I produced a prototype of an async API for ZipArchive, including an async extract() method. This was able to produce a performance improvement over sync extraction without needing to assume file handles are clonable or any of the shortcuts taken in zip-rs/zip-old#408. I'm convinced this is the right way to go for the library now, but I'm going to extract some of the code I had to write to make zip-rs/zip-old#409 work into a separate crate before proposing this as a real change (see TODO section in zip-rs/zip-old#409).

a1phyr commented 7 months ago

With the help of crate sync_file, it is easy to use rayon to do parallel extraction.

With a 600 MB archive with ~3000 files, I get the following result:

Using ZipArchive::extract:

$ time ./target/release/zip_data test.zip zip
real    0m0.994s
user    0m0.543s
sys 0m0.451s

Using rayon + sync_file in a function adapted from ZipArchive::extract:

$ time ./target/release/zip_data test.zip zip_par
real    0m0.255s
user    0m0.753s
sys 0m0.773s

Which is about 4x faster on my i3-10300T (4 core - 8 threads)

The parallel extract function

```rust fn extract_zip_par( archive: zip::ZipArchive, directory: &Path, ) -> zip::result::ZipResult<()> { (0..archive.len()) .into_par_iter() .try_for_each_with(archive, |archive, i| { let mut file = archive.by_index(i)?; let filepath = file .enclosed_name() .ok_or_else(|| zip::result::ZipError::InvalidArchive("Invalid file path"))?; let outpath = directory.join(filepath); if file.name().ends_with('/') { fs::create_dir_all(&outpath)?; } else { if let Some(p) = outpath.parent() { if !p.exists() { fs::create_dir_all(p)?; } } let mut outfile = fs::File::create(&outpath)?; io::copy(&mut file, &mut outfile)?; } // Get and Set permissions #[cfg(unix)] { use std::os::unix::fs::PermissionsExt; if let Some(mode) = file.unix_mode() { fs::set_permissions(&outpath, fs::Permissions::from_mode(mode))?; } } Ok::<_, zip::result::ZipError>(()) })?; Ok(()) } ```

zip-rs / zip2

Parallel extraction #165