zip-rs / zip-old

Zip implementation in Rust
MIT License
731 stars 203 forks source link

performance isn't great #22

Closed ilyail3 closed 7 years ago

ilyail3 commented 7 years ago

Hi, I tried to use this project to extract a billing csv file

I tried a 780mb file(compressed size), uncompressed size was 13gb.

when I tried to walk over the lines in the file using a buffer, I got rather bad throughput overall, so I tried to benchmark just the unzip process.

zcat {filename} > /dev/null finished in ~ 52 seconds uncompressing and copying using std::io::copy(to /dev/null as well) took more than 40min, until I got annoyed with the CPU fan noise and shut it down.

is there any configuration/version you think will make a difference?

mvdnes commented 7 years ago

Have you compiled the program in release mode?

Off the top of my head, it could be that the CRC32 checksum causes the overhead. Could you show me your code and a way to generate such a zip so that I could verify it?

ilyail3 commented 7 years ago

I haven't built the binaries myself, downloaded it using cargo(version 0.2 as the readme file says).

About the code, I'm out of office right now, but it's 99% the same as the extract example https://github.com/mvdnes/zip-rs/blob/master/examples/extract.rs, especially when the script wasn't performing as expected, I've commented everything and just io::copy(ed) into an open /dev/null file.

I can't really share the original zip file since it has financial data of our clients, so I'll work on generating an example, but I don't think the file is anything special, just a high compression ratio text file, one thing I noticed is that the CRC in the zip file is missing, or invalid. But as it's only compared at the end of the decompression, as far as I know anyway, so it shouldn't have made a difference to decompression ratio.

ilyail3 commented 7 years ago

About built type, unfortunately I'm new to rust, so I'm not sure if I built it in release mode, but the command was: cargo run --bin {progname}

mvdnes commented 7 years ago

Can you try to run it with cargo run --release --bin {progname} ?

ilyail3 commented 7 years ago

Here's the code:

fn file_reader() -> i32{
    let args: Vec<_> = std::env::args().collect();
    if args.len() < 3 {
        println!("Usage: {} reader <filename>({})", args[0], args.len());
        return 1;
    }

    let fname = std::path::Path::new(&*args[2]);
    let file = fs::File::open(&fname).unwrap();

    let mut archive = zip::ZipArchive::new(file).unwrap();
    let mut arch_file = archive.by_index(0).unwrap();

    let ctx = zmq::Context::new();
    let mut headers = ctx.socket(zmq::PUB).unwrap();
    headers.connect("ipc:///tmp/headers").unwrap();
    let mut lines = ctx.socket(zmq::PUSH).unwrap();
    lines.bind("ipc:///tmp/lines").unwrap();

    read_lines(&mut arch_file, &mut headers, &mut lines);

    return 0;
}

fn read_lines(file: &mut zip::read::ZipFile, headers: &mut zmq::Socket, lines: &mut zmq::Socket){
    let mut outfile = fs::File::create("/dev/null").unwrap();
    io::copy(file, &mut outfile).unwrap();
}
ilyail3 commented 7 years ago

--release made a huge difference, The file was decompressed in 81.6 seconds

mvdnes commented 7 years ago

Programs written in Rust can have a pretty big performance gain by enabling optimizations.

It seems to me that 81.6 seconds is good enough for a 13GB file. I am closing this issue, but please feel free to reopen it if you are not satisfied.