Open horahh opened 1 year ago
It definitely can be done, but there's one catch: Since the reader
is shared among all zip entries, calling seek
would also affect other entries reading from reader
.
The only ergonomic solution I can think of is
zip
to require read_at
for reader
, which is already implemented for File
reader
to implement Clone
by storing File
to be stored inside an Arc
and keep track of the curent location of cursor in the reader
TryClone
: fn try_clone(&self) -> io::Result<Self>
though this makes it harder to pass a trait objectalso, wondering is there any reason the implementation to traverse the files in the zip file is done with a count rather than iterator?
It definitely can implement the Iterator, it's probably that the maintainers do not have time to add that themselves.
I want to note that with the merge technique from zip-rs/zip-old#401 it becomes possible to parallelize zip creation as well as extraction. I prototyped this in a separate library medusa-zip
but if parallel zip creation is of general enough utility I would also love to contribute it to this library as well!
The medusa-zip
prototype library is async, but I assume we would probably want to convert the parallel zip technique into idiomatic blocking rust for this library of course.
@NobodyXu:
It definitely can be done, but there's one catch: Since the reader is shared among all zip entries, calling seek would also affect other entries reading from reader.
Thanks for this explanation! Could we also perhaps employ parallelism by pipelining the decompression/decryption of each entry though and only take a single pass over the file?
Could we also perhaps employ parallelism by pipelining the decompression/decryption of each entry though
I remember that decompression/decryption is done lazily, but you would need to read the source of ask @Plecra or @zamazan4ik for this.
only take a single pass over the file?
Parsing zip file requires reading head and tail of the file, it can be done in a streaming way which the entire file is read only once, but it's only useful when streaming a http request and unzip it.
If you have the zip stores on fs, then seeking it is quite cheap and should be ok, though iterating once is better for caching.
I will try to make a benchmark to see if the idea works!
I have demonstrated one approach to improve extraction performance with rayon threadpools in zip-rs/zip-old#407, but it has a few caveats and for multiple reasons I am going to try a separate branch that adds the async-executor
crate in order to convert some of this blocking work into async tasks.
It took a lot of iteration, but I produced a prototype of an async API for ZipArchive
, including an async extract()
method. This was able to produce a performance improvement over sync extraction without needing to assume file handles are clonable or any of the shortcuts taken in zip-rs/zip-old#408. I'm convinced this is the right way to go for the library now, but I'm going to extract some of the code I had to write to make zip-rs/zip-old#409 work into a separate crate before proposing this as a real change (see TODO section in zip-rs/zip-old#409).
With the help of crate sync_file
, it is easy to use rayon
to do parallel extraction.
With a 600 MB archive with ~3000 files, I get the following result:
Using ZipArchive::extract
:
$ time ./target/release/zip_data test.zip zip
real 0m0.994s
user 0m0.543s
sys 0m0.451s
Using rayon
+ sync_file
in a function adapted from ZipArchive::extract
:
$ time ./target/release/zip_data test.zip zip_par
real 0m0.255s
user 0m0.753s
sys 0m0.773s
Which is about 4x faster on my i3-10300T (4 core - 8 threads)
I was looking if the library supports any multithreading, would be cool but did not see anything related, has there been any though of that? particularly thinking in something like par_iter to iterate over zip contents