Open powturbo opened 1 year ago
I assume memcpy is just raw memory bandwidth with no compression? Both libdeflate and igzip have the advantage/disadvantage of not being forks from zlib but ground up implementations (on the flip side with incompatible APIs). Not to say there's no room for improvement for zlib-ng, but that is at least a footnote to be provided here. It looks like everybody's compression speed is a bit anemic. Naturally we'd expect compression to be slower, I guess, but I do wonder what's left on the table there without sacrificing compression ratios.
AFAIK libdeflate requires whole-buffer and does not support streaming. So if all you look at is just speed then libdeflate will always win.
On the higher levels zlib-ng compresses twice as fast as zlib, with slight decrease in compression size which is expected.
Somewhat interesting is at level 6 the decompression speed is higher than level 3, but perhaps that speaks to having an inflate that is working more in inflate_fast rather than decompressing literals? I'm just guessing, I don't have profiles to really evaluate that delta, today. I've definitely wanted something that chews through literals in the main inflate loop faster for a while now.
Err well, all the implementations share that trait. I guess it really is just not having to rely on memory read bandwidth as much with those higher compression ratios.
I've added a web content benchmark now. The average web page is 84k, streaming is not relevant here.
Heh, the sorting by size rather than throughput is throwing me off a bit. Looks like we don't do too much worse (in terms of compression throughput) than libdeflate at level 9, albeit with some trade-offs in compression.
Better compression algorithms of course, do better. But I'm not losing sleep over that, those things aren't zlib-ng's purview. It might be worth a weekend dive into the techniques libdeflate is benefiting from for decompression.
Why libdeflate is faster than vanilla zlib for decompression: https://github.com/ebiggers/libdeflate/blob/02dfa32da3ee3982c66278e714d2e21276dfb67b/lib/deflate_decompress.c#L32-L43
Why libdeflate is faster than vanilla zlib for decompression: https://github.com/ebiggers/libdeflate/blob/02dfa32da3ee3982c66278e714d2e21276dfb67b/lib/deflate_decompress.c#L32-L43
Word accesses rather than byte accesses when copying matches
Pretty sure we do this, at least.
Word accesses rather than byte accesses when reading input
I would hope we do this but I'll have to look at the main loop to be sure. I'm not entirely convinced we couldn't do multiple words at a time and try to decode every possible op at once.
Faster Huffman decoding combined with various DEFLATE-specific tricks
That merits a deeper dive to figure out what they're talking about.
Larger bitbuffer variable that doesn't need to be refilled as often
I think we're doing a form of this now.
On x86_64, a version of the decompression routine is compiled with BMI instructions enabled and is used automatically at runtime when supported.
100% doing this, now.
I do envy the quality of docstrings that libdeflate has.
Additionally, libdeflate uses an extra 32KB hash table for 3-byte matches. I don't think we want to consume that much more memory. https://github.com/ebiggers/libdeflate/blob/02dfa32da3ee3982c66278e714d2e21276dfb67b/lib/hc_matchfinder.h#L69-L74
which one is the isa-l in the results?
igzip
Extended benchmark TurboBench: Dynamic/Static web content compression benchmark including zstd and memory usage. zlib-ng memory allocation must be revised to allocate only the minimum necessary!
@powturbo Can you provide some detail on how you ran the benchmarks and how memory was measured? AFAIK, this memory usage is not possible with just zlib-ng the library, as the allocations are static and very small. Were you using minigzip/minideflate or some other application for the benchmarks? Those might have/cause a memory leak that we are not aware of.
This done by TurboBench. The allocate/free functions are intercepted and the memory usage is monitored. Build or download the linux TurboBench from releases and type "./turbobench -ezlib_ng,6 file". The memory & stack usage is reported in the "file.tbb" result file.
AFAIK libdeflate requires whole-buffer and does not support streaming. So if all you look at is just speed then libdeflate will always win.
There is a fork of libdeflate now which added streaming support (and multithreaded compression/decompression: pgzip
program): https://github.com/sisong/libdeflate/tree/stream-mt
https://github.com/ebiggers/libdeflate/issues/335
# pigz with zlib-ng 2.21
$ timeit pigz -c -p 4 fragments.tsv | wc -c
Time output:
------------
* Command: pigz -c -p 4 fragments.tsv
* Elapsed wall time: 0:25.04 = 25.04 seconds
* Elapsed CPU time:
- User: 99.90
- Sys: 1.23
* CPU usage: 403%
* Context switching:
- Voluntarily (e.g.: waiting for I/O operation): 90798
- Involuntarily (time slice expired): 218
* Maximum resident set size (RSS: memory) (kiB): 7408
* Number of times the process was swapped out of main memory: 0
* Filesystem:
- # of inputs: 0
- # of outputs: 0
* Exit status: 0
1132266009 -> compressed size
# libdeflate fork with streaming support (4 compression threads).
# RSS usage is small compared with normal libdeflate.
$ timeit ./libdeflate_streaming/pgzip/pgzip -c fragments.tsv | wc -c
Time output:
------------
* Command: ./libdeflate_streaming/pgzip/pgzip -c fragments.tsv
* Elapsed wall time: 0:18.02 = 18.02 seconds
* Elapsed CPU time:
- User: 70.86
- Sys: 0.82
* CPU usage: 397%
* Context switching:
- Voluntarily (e.g.: waiting for I/O operation): 30841
- Involuntarily (time slice expired): 169
* Maximum resident set size (RSS: memory) (kiB): 28092
* Number of times the process was swapped out of main memory: 0
* Filesystem:
- # of inputs: 0
- # of outputs: 0
* Exit status: 0
1095567700 -> compressed size
# libdeflate fork with streaming support (1 compression threads).
$ timeit ./libdeflate_streaming/pgzip/pgzip -c -p 1 fragments.tsv | wc -c
Time output:
------------
* Command: ./libdeflate_streaming/pgzip/pgzip -c -p 1 fragments.tsv
* Elapsed wall time: 1:12.28 = 72.28 seconds
* Elapsed CPU time:
- User: 71.02
- Sys: 0.82
* CPU usage: 99%
* Context switching:
- Voluntarily (e.g.: waiting for I/O operation): 31250
- Involuntarily (time slice expired): 359
* Maximum resident set size (RSS: memory) (kiB): 6172
* Number of times the process was swapped out of main memory: 0
* Filesystem:
- # of inputs: 0
- # of outputs: 0
* Exit status: 0
1095560363 -> compressed size
# original libdeflate
$ timeit ./libdeflate/gzip -c fragments.tsv | wc -c
Time output:
------------
* Command: ./libdeflate/gzip -c fragments.tsv
* Elapsed wall time: 1:15.49 = 75.49 seconds
* Elapsed CPU time:
- User: 73.95
- Sys: 1.26
* CPU usage: 99%
* Context switching:
- Voluntarily (e.g.: waiting for I/O operation): 34372
- Involuntarily (time slice expired): 370
* Maximum resident set size (RSS: memory) (kiB): 5612864
* Number of times the process was swapped out of main memory: 0
* Filesystem:
- # of inputs: 0
- # of outputs: 0
* Exit status: 0
1126453467 -> compressed size
TurboBench : Build or download executables and test with your own data.
Benchmark1: TurboBench: Dynamic/Static web content compression benchmark