zopencommunity / meta

Meta repository to tie together the various underlying zopen repositories
https://zopen.community
Apache License 2.0
40 stars 30 forks source link

Use better compression algorithm for pax files #613

Open AnthonyGiorgio opened 11 months ago

AnthonyGiorgio commented 11 months ago

The pax files containing the port releases are compressed using the compress algorithm (.Z). This is an ancient, inefficient compression algorithm that has long been supplanted by better choices. I suggest that we only use .Z for the minimal bootstrapping packages, and instead use .xz for everything else. This will significantly reduce both download time and space on disk for the package cache. It has the downside of making xz a dependency on the toolchain, but I think that's a reasonable tradeoff.

~ > ls -l dos2unix-master.20231114_165830.zos.pax.Z 
-rw-r-----   1 ANGIO    DBX       741888 Dec  6 08:18 dos2unix-master.20231114_165830.zos.pax.Z
~ > gunzip dos2unix-master.20231114_165830.zos.pax.Z 
~ > xz dos2unix-master.20231114_165830.zos.pax 
~ > ls -l dos2unix-master.20231114_165830.zos.pax.xz 
-rw-r-----   1 ANGIO    DBX       182528 Dec  6 08:18 dos2unix-master.20231114_165830.zos.pax.xz
IgorTodorovskiIBM commented 11 months ago

Our pax.Z files are meant to work standalone as well. That's a great reduction though. We could provide both flavours in our releases, a .xz package which is consumed by zopen install (and users who have xz installed) and a pax.Z package for those who do not have either.

AnthonyGiorgio commented 11 months ago

Traditionally, downloadable packages have been provided in multiple flavors. Very early versions of GNU tools came in .Z and .gz. When bzip2 came out, they provided .gz and .bz2. After xz was released. I saw all three versions supported for a bit. Nowadays it seems that everything is mostly in .xz format.

v1gnesh commented 11 months ago

I vote for a zstd future. It has various knobs to control the trade-off of size vs compr/decompr speed & CPU. While imagining, it would be awesome to see h/w accelerated zstd succeed h/w accelerated zlib/DEFLATE.

IgorTodorovskiIBM commented 11 months ago

This is really odd:

[ITODORO@ZOSCAN2B ~/projects]$ pax -w  -x pax -f meta.pax meta
[ITODORO@ZOSCAN2B ~/projects]$ pax -w -z -x pax -f meta.pax.Z meta
[ITODORO@ZOSCAN2B ~/projects]$ du -k meta.pax*
     77224 meta.pax
    101184 meta.pax.Z
     71936 meta.pax.zstd

How is meta.pax.Z larger in size than meta.pax?

Both xz and zstd (using -19 as the compression level) resulted in a 7.1mb file. xz was a few hundred bytes smaller.

AnthonyGiorgio commented 11 months ago

It's because the compression algorithm in compress isn't that great. It was fine for a PDP-11 in the 1970's, but we're well beyond that now. compress is supposed to reject files that grow in size, but there's a command line option to suppress that behavior.

DevonianTeuchter commented 11 months ago

With the compressed size being reasonably close, zstd is likely faster to decompress is it not, where we would want it to be fast - the build might take longer at maximum compression but if it saves a wee bit of time for end users at the expense of build-times and a marginally bigger download, might that be a good thing...?

v1gnesh commented 11 months ago

Yeah, we can dial the knobs for zstd, to optimize for size or for compr/decompr speed.

AnthonyGiorgio commented 11 months ago

I hadn't heard of zstd before. Would that be easy to port?

v1gnesh commented 11 months ago

It's already ported.

https://www.infoq.com/news/2022/09/amazon-gzip-zstd/

IgorTodorovskiIBM commented 11 months ago

Yep, just tried it and It seems to be a lot faster than xz on z/OS:

Decompression: zstd: 0.27s

$ zstd -d git-2.43.0.20231127_145951.zos.pax.zst
git-2.43.0.20231127_145951.zos.pax.zst: 111585708 bytes

real    0m0.276s
user    0m0.191s
sys     0m0.064s

vs xz: (1.3s)

$ time xz -d git-2.43.0.20231127_145951.zos.pax.xz

real    0m1.370s
user    0m1.005s
sys     0m0.335s

Compression time is also a lot faster (30s for xz, vs 22s for zstd when I use the highest compression level). However, the compression was not quite as good. xz resulted in a 18mb file and zstd resulting in a 20mb file.

AnthonyGiorgio commented 11 months ago

Decompression is the common case here, as the build server is the only one creating archives.

v1gnesh commented 11 months ago

I would like us to pioneer zstd into the Z ecosystem :muscle: EDIT: But the TS7700 gang beat us to it in the back-end division - https://www.ibm.com/support/pages/system/files/inline-files/TS7770_R_5.0.1_Performance_Version_1.3.pdf