Benchmark: zlib-ng vs isa-l, zlib, libdeflate, brotli

powturbo commented 1 year ago

TurboBench : Build or download executables and test with your own data.

Benchmark1: TurboBench: Dynamic/Static web content compression benchmark

Benchmark 2: turbobench silesia.tar -eigzip,0,1,2,3/zlib_ng,1,3,6,9/libdeflate,1,3,6,9,12/zlib,1,3,6,9/memcpy Hardware: Lenovo Ideapad 5 pro - Ryzen 6600hs / (bold = pareto) MB=1.000.000	C Size	ratio%	C MB/s	D MB/s
64677910	30.5	7.47	1133.66	libdeflate 12
66715898	31.5	43.04	1116.39	libdeflate 9
67511452	31.9	119.35	1127.36	libdeflate 6
67644075	31.9	15.55	483.82	zlib 9
68152563	32.2	27.79	734.94	zlib_ng 9
68228660	32.2	37.74	478.69	zlib 6
68914854	32.5	92.33	735.24	zlib_ng 6
70166917	33.1	185.35	1110.18	libdeflate 3
71068342	33.5	203.57	1085.40	libdeflate 2
72490921	34.2	138.61	694.45	zlib_ng 3
72968832	34.4	86.50	480.09	zlib 3
73505577	34.7	288.56	1075.72	libdeflate 1
75138353	35.5	271.18	1080.75	igzip 3
76571415	36.1	598.69	1047.07	igzip 2
77260023	36.5	127.47	448.95	zlib 1
78154519	36.9	615.11	1020.09	igzip 1
87551010	41.3	638.49	969.43	igzip 0
100929713	47.6	329.63	651.73	zlib_ng 1
211948544	100.0	16146.00	16117.76	memcpy

KungFuJesus commented 1 year ago

I assume memcpy is just raw memory bandwidth with no compression? Both libdeflate and igzip have the advantage/disadvantage of not being forks from zlib but ground up implementations (on the flip side with incompatible APIs). Not to say there's no room for improvement for zlib-ng, but that is at least a footnote to be provided here. It looks like everybody's compression speed is a bit anemic. Naturally we'd expect compression to be slower, I guess, but I do wonder what's left on the table there without sacrificing compression ratios.

nmoinvaz commented 1 year ago

AFAIK libdeflate requires whole-buffer and does not support streaming. So if all you look at is just speed then libdeflate will always win.

On the higher levels zlib-ng compresses twice as fast as zlib, with slight decrease in compression size which is expected.

KungFuJesus commented 1 year ago

Somewhat interesting is at level 6 the decompression speed is higher than level 3, but perhaps that speaks to having an inflate that is working more in inflate_fast rather than decompressing literals? I'm just guessing, I don't have profiles to really evaluate that delta, today. I've definitely wanted something that chews through literals in the main inflate loop faster for a while now.

Err well, all the implementations share that trait. I guess it really is just not having to rely on memory read bandwidth as much with those higher compression ratios.

powturbo commented 1 year ago

I've added a web content benchmark now. The average web page is 84k, streaming is not relevant here.

KungFuJesus commented 1 year ago

Heh, the sorting by size rather than throughput is throwing me off a bit. Looks like we don't do too much worse (in terms of compression throughput) than libdeflate at level 9, albeit with some trade-offs in compression.

Better compression algorithms of course, do better. But I'm not losing sleep over that, those things aren't zlib-ng's purview. It might be worth a weekend dive into the techniques libdeflate is benefiting from for decompression.

nmoinvaz commented 1 year ago

Why libdeflate is faster than vanilla zlib for decompression: https://github.com/ebiggers/libdeflate/blob/02dfa32da3ee3982c66278e714d2e21276dfb67b/lib/deflate_decompress.c#L32-L43

KungFuJesus commented 1 year ago

Why libdeflate is faster than vanilla zlib for decompression: https://github.com/ebiggers/libdeflate/blob/02dfa32da3ee3982c66278e714d2e21276dfb67b/lib/deflate_decompress.c#L32-L43

Word accesses rather than byte accesses when copying matches

Pretty sure we do this, at least.

Word accesses rather than byte accesses when reading input

I would hope we do this but I'll have to look at the main loop to be sure. I'm not entirely convinced we couldn't do multiple words at a time and try to decode every possible op at once.

Faster Huffman decoding combined with various DEFLATE-specific tricks

That merits a deeper dive to figure out what they're talking about.

Larger bitbuffer variable that doesn't need to be refilled as often

I think we're doing a form of this now.

On x86_64, a version of the decompression routine is compiled with BMI instructions enabled and is used automatically at runtime when supported.

100% doing this, now.

Dead2 commented 1 year ago

I do envy the quality of docstrings that libdeflate has.

nmoinvaz commented 1 year ago

Additionally, libdeflate uses an extra 32KB hash table for 3-byte matches. I don't think we want to consume that much more memory. https://github.com/ebiggers/libdeflate/blob/02dfa32da3ee3982c66278e714d2e21276dfb67b/lib/hc_matchfinder.h#L69-L74

danielhrisca commented 1 year ago

which one is the isa-l in the results?

powturbo commented 1 year ago

igzip

powturbo commented 1 year ago

Extended benchmark TurboBench: Dynamic/Static web content compression benchmark including zstd and memory usage. zlib-ng memory allocation must be revised to allocate only the minimum necessary!

Dead2 commented 1 year ago

@powturbo Can you provide some detail on how you ran the benchmarks and how memory was measured? AFAIK, this memory usage is not possible with just zlib-ng the library, as the allocations are static and very small. Were you using minigzip/minideflate or some other application for the benchmarks? Those might have/cause a memory leak that we are not aware of.

powturbo commented 1 year ago

This done by TurboBench. The allocate/free functions are intercepted and the memory usage is monitored. Build or download the linux TurboBench from releases and type "./turbobench -ezlib_ng,6 file". The memory & stack usage is reported in the "file.tbb" result file.

ghuls commented 1 week ago

AFAIK libdeflate requires whole-buffer and does not support streaming. So if all you look at is just speed then libdeflate will always win.

There is a fork of libdeflate now which added streaming support (and multithreaded compression/decompression: pgzip program): https://github.com/sisong/libdeflate/tree/stream-mt https://github.com/ebiggers/libdeflate/issues/335

# pigz with zlib-ng 2.21
$ timeit pigz -c -p 4 fragments.tsv | wc -c                                                                                                                                                   

Time output:                                                                                                                                                                                                                                           
------------                                                                                                                                                                                                                                           

  * Command: pigz -c -p 4 fragments.tsv                                                                                                                                                                   
  * Elapsed wall time: 0:25.04 = 25.04 seconds                                                                                                                                                                                                         
  * Elapsed CPU time:                                                                                                                                                                                                                                  
     - User: 99.90                                                                                                                                                                                                                                     
     - Sys: 1.23                                                                                                                                                                                                                                       
  * CPU usage: 403%                                                                                                                                                                                                                                    
  * Context switching:                                                                                                                                                                                                                                 
     - Voluntarily (e.g.: waiting for I/O operation): 90798                                                                                                                                                                                            
     - Involuntarily (time slice expired): 218                                                                                                                                                                                                         
  * Maximum resident set size (RSS: memory) (kiB): 7408                                                                                                                                                                                                
  * Number of times the process was swapped out of main memory: 0                                                                                                                                                                                      
  * Filesystem:                                                                                                                                                                                                                                        
     - # of inputs: 0                                                                                                                                                                                                                                  
     - # of outputs: 0                                                                                                                                                                                                                                 
  * Exit status: 0                                                                                                                                                                                                                                     

1132266009 -> compressed size

# libdeflate fork with streaming support (4 compression threads).
# RSS usage is small compared with normal libdeflate.
$ timeit ./libdeflate_streaming/pgzip/pgzip -c fragments.tsv | wc -c

Time output:
------------

  * Command: ./libdeflate_streaming/pgzip/pgzip -c fragments.tsv
  * Elapsed wall time: 0:18.02 = 18.02 seconds
  * Elapsed CPU time:
     - User: 70.86
     - Sys: 0.82
  * CPU usage: 397%
  * Context switching:
     - Voluntarily (e.g.: waiting for I/O operation): 30841
     - Involuntarily (time slice expired): 169
  * Maximum resident set size (RSS: memory) (kiB): 28092
  * Number of times the process was swapped out of main memory: 0
  * Filesystem:
     - # of inputs: 0
     - # of outputs: 0
  * Exit status: 0

1095567700 -> compressed size

# libdeflate fork with streaming support (1 compression threads).
$ timeit ./libdeflate_streaming/pgzip/pgzip -c -p 1 fragments.tsv | wc -c

Time output:
------------

  * Command: ./libdeflate_streaming/pgzip/pgzip -c -p 1 fragments.tsv
  * Elapsed wall time: 1:12.28 = 72.28 seconds
  * Elapsed CPU time:
     - User: 71.02
     - Sys: 0.82
  * CPU usage: 99%
  * Context switching:
     - Voluntarily (e.g.: waiting for I/O operation): 31250
     - Involuntarily (time slice expired): 359
  * Maximum resident set size (RSS: memory) (kiB): 6172
  * Number of times the process was swapped out of main memory: 0
  * Filesystem:
     - # of inputs: 0
     - # of outputs: 0
  * Exit status: 0

1095560363 -> compressed size

# original libdeflate
$ timeit ./libdeflate/gzip -c fragments.tsv | wc -c

Time output:
------------

  * Command: ./libdeflate/gzip -c fragments.tsv
  * Elapsed wall time: 1:15.49 = 75.49 seconds
  * Elapsed CPU time:
     - User: 73.95
     - Sys: 1.26
  * CPU usage: 99%
  * Context switching:
     - Voluntarily (e.g.: waiting for I/O operation): 34372
     - Involuntarily (time slice expired): 370
  * Maximum resident set size (RSS: memory) (kiB): 5612864
  * Number of times the process was swapped out of main memory: 0
  * Filesystem:
     - # of inputs: 0
     - # of outputs: 0
  * Exit status: 0

1126453467 -> compressed size

zlib-ng / zlib-ng

Benchmark: zlib-ng vs isa-l, zlib, libdeflate, brotli #1486