Open andrews05 opened 2 years ago
zopfli-rs
provides bindings for the original Zopfli C code, and that code does not use a particularly fast CRC32 implementation, at least for the Gzip format.
Besides, as admitted in the zopfli-rs
benchmarks, the performance of both crates is non-linear with respect to the input data size, which makes it harder to reason about. Experimentally, I've also found substantial differences in runtime when trying to compress random data vs. data with a predictable distribution, even if they have the same size. It may happen that this Rust reimplementation is slower for that particular benchmark, but faster in others. And that benchmark tests input sizes of less than 1 KiB, which are so tiny that they don't even fill the DEFLATE window.
Anyway, I think that the performance of this crate can be improved! Using crc32fast
is certainly an option, and so is multi-threaded compression.
Yeah, I'm not too worried about performance really, zopfli is always going to be horrendously slow 🤣. I just came across crc32fast and thought it might be useful here.
Although, ECT has demonstrated that fast zopfli is possible. I guess you could try and figure out how it was done. zopfli#119 and zopfli#110 (including the discussions) may be good reference.
When dealing with near-optimal algorithms like Zopfli, one must keep the no free lunch theorem in mind. A patch may perform better on a given benchmark, in detriment of other possible benchmarks :wink:
However, thank you for sharing the links! ECT certainly did some changes that are worth looking into, and multi-threading sounds like a relatively low-hanging fruit that could bring some significant performance improvements to the table.
The current implementation looks very inefficient, handling io::Error for every byte: https://github.com/zopfli-rs/zopfli/blob/41d8712bc16750ac5a9cd958c33fa126c31b693b/src/gzip.rs#L31-L50 so I think there's plenty of room to improve.
I was focusing my attention on the deflate
function implementation, but that indeed looks slow! I use Zopfli to generate raw DEFLATE streams, so obviously any IterRead
overhead here doesn't show in profiling graphs.
Edit: I've realized I had a much more efficient-looking hashing read wrapper type lying around from another Rust project of mine, so let's transplant it to Zopfli and see how it performs!
I've switched to crc32fast
and got rid of IterRead
in the latest commit. I'm not sure how much it helps performance yet, but at least the code is more elegant now :smile:
That looks great!
Nice! I measured this around 3-4% faster on one file I tested.
When dealing with near-optimal algorithms like Zopfli, one must keep the no free lunch theorem in mind. A patch may perform better on a given benchmark, in detriment of other possible benchmarks 😉
Ha, true! Though in my experience ECT can compress better than zopfli's default settings while still being about 100x faster (without multithreading).
On the topic of near-optimal algorithms, libdeflate has its own implementation which is also very fast, though falls just short of the ratios that ECT/zopfli can reach.
Indeed, by the looks of it, ECT and other projects you've linked to did quite a few smart optimizations to the hot paths that work better in the practical scenarios I can imagine. Admittedly, this crate is more focused on correctness and being a faithful Rust port of the original Zopfli code than performance. Anyway, I'm glad that cleanup worked well for you, and I'd like this crate to become faster too, by implementing some of those ideas!
Truth be told, however, I didn't dive that deep into the matter, and getting these optimizations done right requires a lot of effort. I feel confident about implementing multi-threading at some point, but gaining further substantial speedups by accumulating known micro-optimizations is a task that flies over my head right now :innocent:
For the time being, I'd like to keep this issue open to track performance improvement ideas.
Yeah, I totally understand. Wish I knew more about compression myself 🙂
I got some time and will to run zopfli
under perf
. I noticed that squeeze::get_cost_stat
was taking up a significant chunk of CPU cycles, so I decided to start by avoiding some of the unsigned-signed type mish-mash that goes on there. This simple change brought down the compression time for a test file from ~14.91 s to ~13.82 s (7.9% improvement) on an Intel Core i7-12700 CPU.
This is by no means a scientific benchmark or the end of the road when it comes to performance optimization, but I'm amazed that this relatively simple change had such outsized impact :smile:
Nice find!
This SIMD accelerated crc library may be able to improve performance: https://github.com/srijs/rust-crc32fast Not sure if this is related to the performance difference between this and zopfli-rs?