zlib-ng / zlib-ng

zlib replacement with optimizations for "next generation" systems.
zlib License
1.54k stars 254 forks source link

Benchmark: zlib-ng vs isa-l, zlib, libdeflate, brotli #1486

Open powturbo opened 1 year ago

powturbo commented 1 year ago

TurboBench : Build or download executables and test with your own data.

Benchmark1: TurboBench: Dynamic/Static web content compression benchmark

Benchmark 2: turbobench silesia.tar -eigzip,0,1,2,3/zlib_ng,1,3,6,9/libdeflate,1,3,6,9,12/zlib,1,3,6,9/memcpy Hardware: Lenovo Ideapad 5 pro - Ryzen 6600hs / (bold = pareto) MB=1.000.000 C Size ratio% C MB/s D MB/s Name
64677910 30.5 7.47 1133.66 libdeflate 12
66715898 31.5 43.04 1116.39 libdeflate 9
67511452 31.9 119.35 1127.36 libdeflate 6
67644075 31.9 15.55 483.82 zlib 9
68152563 32.2 27.79 734.94 zlib_ng 9
68228660 32.2 37.74 478.69 zlib 6
68914854 32.5 92.33 735.24 zlib_ng 6
70166917 33.1 185.35 1110.18 libdeflate 3
71068342 33.5 203.57 1085.40 libdeflate 2
72490921 34.2 138.61 694.45 zlib_ng 3
72968832 34.4 86.50 480.09 zlib 3
73505577 34.7 288.56 1075.72 libdeflate 1
75138353 35.5 271.18 1080.75 igzip 3
76571415 36.1 598.69 1047.07 igzip 2
77260023 36.5 127.47 448.95 zlib 1
78154519 36.9 615.11 1020.09 igzip 1
87551010 41.3 638.49 969.43 igzip 0
100929713 47.6 329.63 651.73 zlib_ng 1
211948544 100.0 16146.00 16117.76 memcpy
KungFuJesus commented 1 year ago

I assume memcpy is just raw memory bandwidth with no compression? Both libdeflate and igzip have the advantage/disadvantage of not being forks from zlib but ground up implementations (on the flip side with incompatible APIs). Not to say there's no room for improvement for zlib-ng, but that is at least a footnote to be provided here. It looks like everybody's compression speed is a bit anemic. Naturally we'd expect compression to be slower, I guess, but I do wonder what's left on the table there without sacrificing compression ratios.

nmoinvaz commented 1 year ago

AFAIK libdeflate requires whole-buffer and does not support streaming. So if all you look at is just speed then libdeflate will always win.

On the higher levels zlib-ng compresses twice as fast as zlib, with slight decrease in compression size which is expected.

KungFuJesus commented 1 year ago

Somewhat interesting is at level 6 the decompression speed is higher than level 3, but perhaps that speaks to having an inflate that is working more in inflate_fast rather than decompressing literals? I'm just guessing, I don't have profiles to really evaluate that delta, today. I've definitely wanted something that chews through literals in the main inflate loop faster for a while now.

Err well, all the implementations share that trait. I guess it really is just not having to rely on memory read bandwidth as much with those higher compression ratios.

powturbo commented 1 year ago

I've added a web content benchmark now. The average web page is 84k, streaming is not relevant here.

KungFuJesus commented 1 year ago

Heh, the sorting by size rather than throughput is throwing me off a bit. Looks like we don't do too much worse (in terms of compression throughput) than libdeflate at level 9, albeit with some trade-offs in compression.

Better compression algorithms of course, do better. But I'm not losing sleep over that, those things aren't zlib-ng's purview. It might be worth a weekend dive into the techniques libdeflate is benefiting from for decompression.

nmoinvaz commented 1 year ago

Why libdeflate is faster than vanilla zlib for decompression: https://github.com/ebiggers/libdeflate/blob/02dfa32da3ee3982c66278e714d2e21276dfb67b/lib/deflate_decompress.c#L32-L43

KungFuJesus commented 1 year ago

Why libdeflate is faster than vanilla zlib for decompression: https://github.com/ebiggers/libdeflate/blob/02dfa32da3ee3982c66278e714d2e21276dfb67b/lib/deflate_decompress.c#L32-L43

Word accesses rather than byte accesses when copying matches

Pretty sure we do this, at least.

Word accesses rather than byte accesses when reading input

I would hope we do this but I'll have to look at the main loop to be sure. I'm not entirely convinced we couldn't do multiple words at a time and try to decode every possible op at once.

Faster Huffman decoding combined with various DEFLATE-specific tricks

That merits a deeper dive to figure out what they're talking about.

Larger bitbuffer variable that doesn't need to be refilled as often

I think we're doing a form of this now.

On x86_64, a version of the decompression routine is compiled with BMI instructions enabled and is used automatically at runtime when supported.

100% doing this, now.

Dead2 commented 1 year ago

I do envy the quality of docstrings that libdeflate has.

nmoinvaz commented 1 year ago

Additionally, libdeflate uses an extra 32KB hash table for 3-byte matches. I don't think we want to consume that much more memory. https://github.com/ebiggers/libdeflate/blob/02dfa32da3ee3982c66278e714d2e21276dfb67b/lib/hc_matchfinder.h#L69-L74

danielhrisca commented 1 year ago

which one is the isa-l in the results?

powturbo commented 1 year ago

igzip

powturbo commented 1 year ago

Extended benchmark TurboBench: Dynamic/Static web content compression benchmark including zstd and memory usage. zlib-ng memory allocation must be revised to allocate only the minimum necessary!

Dead2 commented 1 year ago

@powturbo Can you provide some detail on how you ran the benchmarks and how memory was measured? AFAIK, this memory usage is not possible with just zlib-ng the library, as the allocations are static and very small. Were you using minigzip/minideflate or some other application for the benchmarks? Those might have/cause a memory leak that we are not aware of.

powturbo commented 1 year ago

This done by TurboBench. The allocate/free functions are intercepted and the memory usage is monitored. Build or download the linux TurboBench from releases and type "./turbobench -ezlib_ng,6 file". The memory & stack usage is reported in the "file.tbb" result file.

ghuls commented 1 week ago

AFAIK libdeflate requires whole-buffer and does not support streaming. So if all you look at is just speed then libdeflate will always win.

There is a fork of libdeflate now which added streaming support (and multithreaded compression/decompression: pgzip program): https://github.com/sisong/libdeflate/tree/stream-mt https://github.com/ebiggers/libdeflate/issues/335

# pigz with zlib-ng 2.21
$ timeit pigz -c -p 4 fragments.tsv | wc -c                                                                                                                                                   

Time output:                                                                                                                                                                                                                                           
------------                                                                                                                                                                                                                                           

  * Command: pigz -c -p 4 fragments.tsv                                                                                                                                                                   
  * Elapsed wall time: 0:25.04 = 25.04 seconds                                                                                                                                                                                                         
  * Elapsed CPU time:                                                                                                                                                                                                                                  
     - User: 99.90                                                                                                                                                                                                                                     
     - Sys: 1.23                                                                                                                                                                                                                                       
  * CPU usage: 403%                                                                                                                                                                                                                                    
  * Context switching:                                                                                                                                                                                                                                 
     - Voluntarily (e.g.: waiting for I/O operation): 90798                                                                                                                                                                                            
     - Involuntarily (time slice expired): 218                                                                                                                                                                                                         
  * Maximum resident set size (RSS: memory) (kiB): 7408                                                                                                                                                                                                
  * Number of times the process was swapped out of main memory: 0                                                                                                                                                                                      
  * Filesystem:                                                                                                                                                                                                                                        
     - # of inputs: 0                                                                                                                                                                                                                                  
     - # of outputs: 0                                                                                                                                                                                                                                 
  * Exit status: 0                                                                                                                                                                                                                                     

1132266009 -> compressed size

# libdeflate fork with streaming support (4 compression threads).
# RSS usage is small compared with normal libdeflate.
$ timeit ./libdeflate_streaming/pgzip/pgzip -c fragments.tsv | wc -c

Time output:
------------

  * Command: ./libdeflate_streaming/pgzip/pgzip -c fragments.tsv
  * Elapsed wall time: 0:18.02 = 18.02 seconds
  * Elapsed CPU time:
     - User: 70.86
     - Sys: 0.82
  * CPU usage: 397%
  * Context switching:
     - Voluntarily (e.g.: waiting for I/O operation): 30841
     - Involuntarily (time slice expired): 169
  * Maximum resident set size (RSS: memory) (kiB): 28092
  * Number of times the process was swapped out of main memory: 0
  * Filesystem:
     - # of inputs: 0
     - # of outputs: 0
  * Exit status: 0

1095567700 -> compressed size

# libdeflate fork with streaming support (1 compression threads).
$ timeit ./libdeflate_streaming/pgzip/pgzip -c -p 1 fragments.tsv | wc -c

Time output:
------------

  * Command: ./libdeflate_streaming/pgzip/pgzip -c -p 1 fragments.tsv
  * Elapsed wall time: 1:12.28 = 72.28 seconds
  * Elapsed CPU time:
     - User: 71.02
     - Sys: 0.82
  * CPU usage: 99%
  * Context switching:
     - Voluntarily (e.g.: waiting for I/O operation): 31250
     - Involuntarily (time slice expired): 359
  * Maximum resident set size (RSS: memory) (kiB): 6172
  * Number of times the process was swapped out of main memory: 0
  * Filesystem:
     - # of inputs: 0
     - # of outputs: 0
  * Exit status: 0

1095560363 -> compressed size

# original libdeflate
$ timeit ./libdeflate/gzip -c fragments.tsv | wc -c

Time output:
------------

  * Command: ./libdeflate/gzip -c fragments.tsv
  * Elapsed wall time: 1:15.49 = 75.49 seconds
  * Elapsed CPU time:
     - User: 73.95
     - Sys: 1.26
  * CPU usage: 99%
  * Context switching:
     - Voluntarily (e.g.: waiting for I/O operation): 34372
     - Involuntarily (time slice expired): 370
  * Maximum resident set size (RSS: memory) (kiB): 5612864
  * Number of times the process was swapped out of main memory: 0
  * Filesystem:
     - # of inputs: 0
     - # of outputs: 0
  * Exit status: 0

1126453467 -> compressed size