zarr-developers / numcodecs

A Python package providing buffer compression and transformation codecs for use in data storage and communication applications.
http://numcodecs.readthedocs.io
MIT License
128 stars 88 forks source link

Support new "bitinformation" codec in numcodecs #298

Closed rabernat closed 2 years ago

rabernat commented 2 years ago

There is an exciting new paper by @milankl in Nature Computational Science entitled Compressing atmospheric data into its real information content

Abstract

Hundreds of petabytes are produced annually at weather and climate forecast centers worldwide. Compression is essential to reduce storage and to facilitate data sharing. Current techniques do not distinguish the real from the false information in data, leaving the level of meaningful precision unassessed. Here we define the bitwise real information content from information theory for the Copernicus Atmospheric Monitoring Service (CAMS). Most variables contain fewer than 7 bits of real information per value and are highly compressible due to spatio-temporal correlation. Rounding bits without real information to zero facilitates lossless compression algorithms and encodes the uncertainty within the data itself. All CAMS data are 17× compressed relative to 64-bit floats, while preserving 99% of real information. Combined with four-dimensional compression, factors beyond 60× are achieved. A data compression Turing test is proposed to optimize compressibility while minimizing information loss for the end use of weather and climate forecast data.

More relevant description of the method

Based on the bitwise real information content, we suggest a strategy for the data compression of climate variables. First, we diagnose the real information for each bit position. Afterwards, we round bits with no significant real information to zero, before applying lossless data compression. This allows us to minimize information loss but maximize the efficiency of the compression algorithms.

Bits with no or only little real information (but high entropy) are discarded via binary round-to-nearest as defined in the IEEE-754 standard. This rounding mode is bias-free and therefore will ensure global conservation of the quantities that are important in climate model data. Rounding removes the incompressible false information and therefore increases compressibility. Although rounding is irreversible for the bits with false information, the bits with real information remain unchanged and are bitwise reproducible after decompression. Both the real information analysis and the rounding mode are deterministic, also satisfying reproducibility.

Lossless compression algorithms can be applied efficiently to rounded floating-point arrays (the round + lossless method). Many general-purpose lossless compression algorithms are available and are based on dictionaries and other statistical techniques to remove redundancies. Most algorithms operate on bitstreams and exploit the correlation of data in a single dimension only, so we describe such methods as one-dimensional (1D) compression. Here, we use the Zstandard algorithm for lossless compression, which has emerged as a widely available default in recent years.

... In an operational setting we recommend the following workflow. First, for each variable, the bitwise real information content is analyzed from a representative subset of the data. For example, a single time step can be representative of subsequent time steps if the statistics of the data distribution are not expected to change. From the bitwise real information, the number of mantissa bits to preserve 99% of information is determined (the ‘keepbits’). Second, during the simulation, the arrays that will be archived are rounded to the number of keepbits (which are held fixed) and compressed. The first step should be done offline—once in advance of a data-producing simulation. Only the second step has to be performed online, meaning every time a chunk of data is archived.

Prior Conversation

The compressor is currently implemented in Julia. I emailed Milan to ask about possible paths for supporting this compression codec in numcodecs. I am now moving the discussion here so other devs can weigh in.

I posed the question in the following way

We would like to investigate implementing this compressor in Numcodecs we can can use it with Zarr data. In order to do that, we would need to either

  • call Julia directly from numcodecs (certainly possible but would limit usage)
  • reimplement the algorithm in python (not appealing)
  • Export a c-compatible binary library from Julia (possible but perhaps difficult)

Milan replied

Thanks so much for your interest! I’d be more than happy to develop BitInformation.jl more towards a reference implementation for the analysis of the bitwise information content. But you are quite right that maintaining similar code in several languages might be tricky and unnecessary additional effort. Before making a suggestion to either of these options let me highlight that there are technically two parts to our “compression method”. Quotation marks as the idea of the real bitwise information is certainly also that it should be possible to combine it with other existing compression methods, as we do with Zfp in our paper.

part 1. The bitwise information analysis. This just informs how many bits should be retained for a given array, and does not have to be repeated, say for following time steps

part 2. The removal of false information. We simply round this to zero with round-to-nearest, while this doesn’t technically compress it can be combined with any lossless algorithm and I see you have plenty in numcodecs already.

So I think we’d need two things in numcodecs, for

part 1 the bitinformation function from BitInformation.jl If there’s a way to pull that one directly from Julia into numcodecs that’d be great because then I can keep updating / maintaining the BitInformation.jl and you’d get automatically the newest version.

part 2 a binary round to nearest. I see that np.round rounds in decimal, but we’d need a binary rounding mode to guarantee that the trailing bits are zero. This is technically 3 lines of code and so that can be done maybe directly within C/Cython.

So I’d suggest to reimplement part 2 (the rounding) as it’s only a small piece of code, but we should check how to call Julia efficiently for part 1.

Since we already support ZFP in numcodecs, I would say that this strategy makes a lot of sense

I'm copying Fig. 1 from the paper into this issue, since it provides a great visualization of the method

image

Next Steps

I would note that there is also a final step in this method, which is to apply a lossy (zfp) or lossless compression codec (zstd) to to the data.

As a concrete step forward, we could imagine first implementing the "binary rounding" as a standalone codec in numcodecs.

One technical question I have for @milankl is the following: The round function has a signature (translated to python like)

def round(x: float, setmask: int, shift: int, shavemask: int) -> float:
    ...

where x is the array to be rounded. Can you confirm that setmask, shift, and shavemask are constants for each variable, which are determined based on the mutual information analysis from "part 1"? If this is the case, we can simply treat these as scalar parameters for the bitwise rounding codec.

This also suggests an even simpler path to supporting this compressor in python:

I'm very excited about this opportunity and really eager to try it out, e.g. on the LLC4320 ocean simulations.

milankl commented 2 years ago

Can you confirm that setmask, shift, and shavemask are constants for each variable, which are determined based on the mutual information analysis from "part 1"?

Yes, for a given variable, say temperature, if you want to keep 5 significant/mantissa bits the setmask, shift, shavemask are constant. E.g shavemask is just 1 for all bits that should be kept and 0 else. In practice I do

https://github.com/milankl/BitInformation.jl/blob/b335571f5d56c12df579c39ccca1bc29b681f61e/src/round.jl#L48-L59

So for a given array and nsb the number of significant bits that should be kept, the setmask, shift, shavemask are then defined once and for every element in the array passed to the actual rounding function.

milankl commented 2 years ago

(note that technically it would also be possible to choose a different precision for different parts of the array. E.g. one precision around the poles and another precision elsewhere. I didn't bother to allow for such an option, but as long as you run a lossless compressor over it, it will just avoid the explicit storage of all those trailing bits that are 0. I don't think that there's any other compression method that would allow such variable precision compression...)

rabernat commented 2 years ago

Ok, so nsb is the only true parameter for the rounding codec. Everything else is derived from it. Makes sense.

This discussion feels similar to my experience trying to use ZFP as a standalone compressor. ZFP requires to you choose a mode - fixed rate, fixed precision, or fixed accuracy - and a corresponding rate, precision or accuracy parameter. (There is also a "reversible" (lossless) mode.) I never knew how to choose these parameters for a given array, and that is why I have never really used ZFP much so far. It seems that your nsb parameter would also be relevant to this question.

When you use ZFP on top of nsb-rounding, do you use the lossless mode?

rabernat commented 2 years ago

When you use ZFP on top of nsb-rounding, do you use the lossless mode?

NM, should have read the paper more carefully! 😉

We use Zfp in its precision mode, which offers discrete levels to manually adjust the retained precision. Owing to the rather logarithmic distribution of CAMS data (Supplementary Fig. 1), a log-preprocessing of the data is applied to prevent sign changes (including a flushing to zero) within the compression57,58. The error introduced by Zfp is approximately normally distributed and therefore usually yields higher maximum errors compared to round-to-nearest in float arithmetic, although median errors are comparable. To find an equivalent error level between the two methods, we therefore choose the precision level of Zfp to yield median absolute and decimal errors that are at least as small as those from rounding. The manual choice of the precision level is hence tied to the analysis of the bitwise real information content and automated.

milankl commented 2 years ago

Ok, so nsb is the only true parameter for the rounding codec. Everything else is derived from it. Makes sense.

Exactly, and nsb is then obtained from the bitwise real information content.

When you use ZFP on top of nsb-rounding, do you use the lossless mode?

As you quote the paper, but also note that I've long been struggeling to automate this translation. In that sense, I've usually been more a fan of round+lossless than Zfp, but that's because I understand rounding for floats and the errors are very easy to control (in an absolute and relative sense). As I don't fully understand the block transform in Zfp I've often ended up with maximum errors higher than I wanted to tolerate.

In that sense I found round+lossless a much safer bet even for weirdly distributed data. As Zfp requires to some degree multidimensional correlation and smoothness it was often not clear to me what happens with any data that's a bit odd (strong gradients etc)

Also note that as long as you are dealing with 2D data, Zfp isn't necessarily better than round+lossless with Zstandard (Fig. 5 in the paper). For 3D, 4D Zfp can exploit the multidimensional correlation well, but for 2D that advantage is limited. While 3D or 4D compression sounds good, it also means that data is compressed in bigger chunks and all dimensions have to be decompressed even if only a slice of the data is requested. I'm not saying that Zfp isn't good at what it's designed for, but round+lossless might be easier to work with for many purposes.

jakirkham commented 2 years ago

Wonder if it makes sense to have 2 codecs users could combine? Namely bitgrooming ( https://github.com/zarr-developers/numcodecs/issues/1 ) + ZFP

rabernat commented 2 years ago

Yes, that was my interpretation of how we would implement this. Milan's bitwise rounding is similar to but distinct from bitgrooming. The bitinformation.jl docs give a nice overview of different floating point rounding methods.

We might as well implement all of these at once, as they all similar and fairly straightforward to code, see e.g.

milankl commented 2 years ago

Wonder if it makes sense to have 2 codecs users could combine? Namely bitgrooming ( #1 ) + ZFP

Note that this only works with ZFP lossless, which I never found to yield better compression factors than other lossless compressors.

Also I'm not a fan of bitgrooming. Sure, it's easier to understand than IEEE's round-to-nearest tie to even, but in the end both are only a few lines of code, and if well implemented they are both faster than 5GB/s which far beyond the throughput of any compressor

julia> using BitInformation, BenchmarkTools
julia> A = rand(Float32,10000000);    # a 40MB array
julia> @btime groom(A,5);
  6.411 ms (2 allocations: 38.15 MiB)
julia> @btime round(A,5);
  5.327 ms (2 allocations: 38.15 MiB)

That's 40MB/0.006s > 6GB/s including the memory allocation for the rounded array. So why not opting for the standardised (for good reasons) IEEE rounding if any overhead is negligible anyway?

jakirkham commented 2 years ago

No objection to having a rounding filter as well

rabernat commented 2 years ago

So now we just need to find someone willing to translate the Julia rounding code to Python / Cython! 🙃

rsignell-usgs commented 2 years ago

@milankl is this the best example of a complete use case? https://github.com/esowc/Elefridge.jl/blob/master/nb/bitinformation.ipynb I'd like to try re-compressing an existing zarr dataset using this approach.

milankl commented 2 years ago

I wrote a tutorial in a_tutorial.ipynb here https://codeocean.com/capsule/0545810/tree/v1

rsignell-usgs commented 2 years ago

@milankl , that is awesome -- a perfect tutorial for what I want to do.
I had never heard of codeocean before this morning, but was able to sign up, copy your capsule and run your code in Jupyter with no problem! Now on to my own data!

milankl commented 2 years ago

I just overhauled IEEE rounding in BitInformation.jl and released v0.3. Now arbitrary float types are supported (incl Float16 and BFloat16) and the whole thing is even faster than before. It's written in a much more Julian way, using a lot of multiple dispatch, but that way I really only need to implement the algorithm once in a fairly abstract way and Julia compiles it automatically to any float format. I reckon that for python/cython the abstractions unfortunately need to be unpacked to still get code with a good performance. In any case, I'm happy to provide guidance for an implementation into numcodecs!

halehawk commented 2 years ago

I tried quantize of numcodecs before, I thought it is similar to bitgrooming. But quantize compressed the data and saved to 16 bits at most. It is not ideal to bundle with zfp. Now I am thinking to get the number of bits information from Milan's bitinformation, then send to zfp precision mode to compress. But this is only for spatial correlated data. For non spatial correlated data, we cannot use bitinformation. So we have a correlated test first.

milankl commented 2 years ago

No objection to having a rounding filter as well

@jakirkham I looked at efficient implementations of round to nearest, bitshave, halfshave, bitset and bitgrooming a bit more. In BitInformation.jl main branch all except bitgrooming reach about 16GB/s on my macbook at which they are probably memory-bound rather than compute-bound. Bitgrooming is with ~10GB/s slower, presumably because of the alternating instructions which can't be as efficiently pipelined. I've tried walking in steps of two through the array, but that slows things down by 2x presumably because 2x larger chunks of data have to be read in simultaneously. So I really don't see any reason (tell me if I'm wrong) why anyone would want bitshave/halfshave/bitset of bitgrooming if IEEE's round to nearest (tie to even) is also available. Timings

julia> using BitInformation, BenchmarkTools
julia> A = rand(Float32,10_000_000)
julia> @btime round!($A,5);
  2.295 ms (0 allocations: 0 bytes)

julia> @btime shave!($A,5);
  2.165 ms (0 allocations: 0 bytes)

julia> @btime halfshave!($A,5);
  2.149 ms (0 allocations: 0 bytes)

julia> @btime set_one!($A,5);
  2.182 ms (0 allocations: 0 bytes)

julia> @btime groom!($A,5);
  3.791 ms (0 allocations: 0 bytes)

@czender you may find this interesting

czender commented 2 years ago

Thanks for pointing me to this, @milankl. The BitGroom (BG) algorithm is in some sense frozen, i.e., the GMD article defines it with the alternating shave/set algorithm. However, the default NCO lossy compression/quantization algorithm has changed over time. After discussing the rounding with @rkouznetsov and you, we adopted, in July 2020, the bitround algorithm subsequently documented here, though the number of keep bits remained unchanged. Please clarify if this bitround is distinct from IEEE round-to-nearest (tie to even)? In version 5.0.3 (October), NCO changed its default lossy algorithm to Granular BitGroom (GBG), which more aggressively rounds bits than BG following methods of Delaunay et al. GBG evaluates each value separately to determine the number of keep bits (hence it is "granular"). On average, GBG eliminates about 3 more bits than BG. GBG also uses bitround for rounding. The way to access all of these quantization algorithms is described here. I see you have found this thread.

milankl commented 2 years ago

Yes, round here is IEEE's round-to-nearest tie to even as implemented here. Which looks exactly like what @rkouznetsov was implementing (thanks for the credits!).

image

So exactly as it's done on hardware but with a variable number of mantissa bits. Do I understand this right that GBG is just to translate better between binary and decimal representations? As in, if I live in binary world and only think in bits then it's just bitgrooming?

Related; how can GBG if it has bitgroom in its name use round (to nearest)? Maybe it's just an issue of phrasing things ;) but grooming (=shave/set to one alternatingly) and rounding (=round to nearest, tie to even) are exclusive no?

czender commented 2 years ago

As you suspect it's just a matter of phrasing, I picked the name GBG for a proposal submitted before I knew what the exact algorithm would be, except that it would be granular. In fact GBG rounds and does not shave/set. GBG's determines the number of keep bits based on two inputs: the NSD requested, and the numeric value to be quantized. The latter requires the log of the mantissa as Delaunay discuss. So yes GBG does translate better between binary and decimal but it's not bitgrooming.

rkouznetsov commented 2 years ago

Hi Everyone, Thanks for the interesting discussion! I really have to read about that information content analysis.

If I got it right, GBG is made to keep least bits while saving given number of decimal digits for every value regardless a relative error introduced. Bit rounding minimizes the SUP of relative error introduced for a given number of keep bits.

If one is really concerned about decimals -- GBG is optimal. I'd bet that ronding should be the method of choice unless the data are really going to be printed in a decimal form.

Please correct me if I am wrong.

rsignell-usgs commented 2 years ago

@rkouznetsov I think I found your paper with the python rounding algorithms by googling with some of the text from the @milankl screen grab above!

It's great you already implemented these in Python and shared the code in your paper!

def TrimPrecision(a, keepbits):
    assert (a.dtype == np.float32)
    b = a.view(dtype=np.int32)
    maskbits = 23 - keepbits
    mask = (0xFFFFFFFF >> maskbits)<<maskbits
    half_quantum1 = ( 1 << (maskbits - 1) ) - 1
    b += ((b >> maskbits) & 1) + half_quantum1
    b &= mask
czender commented 2 years ago

@rkouznetsov GBG uses a smarter algorithm than BG to determine the minimum number of keep bits to guarantee preserving an integral number of decimal digits of precision. The result is rounded using your bitround code. GBG was made the default lossy compression in NCO in 5.0.3 (October).

rkouznetsov commented 2 years ago

Thank you! @czender, Is there any difference between GBG and Decimal Rounding? @rsignell-usgs, Good luck with the code! The lossy compression is a powerful tool, but it is is irreversible, so one can easily shoot own foot.. In the paper I have tried to point that the needed number of keepbits depends on the intended application of the data. @milankl et al approach this problem from another side trying to distinguish between informative and non-informative bits considering the input data only..

czender commented 2 years ago

@rkouznetsov We have to be careful of the terminology because there are so many similar names for lossy algorithms. I am not sure exactly what "DecimalRounding" refers to. GBG uses a continuous (no lookup table) bit-oriented version of the approach called DigitRound by Delaunay et al. to determine the keeps bits. The results are then IEEE rounded instead of (as Delaunay et al. do, AFAICT) being quantized to zero. The code is option baa=4 in nco_ppc.c which you are familiar with.

rkouznetsov commented 2 years ago

@milankl, Thank you for the paper! It is a very interesting concept.

Could you please confirm that cloud quantities in CAMS-GLOB indeed need only one (implicit) bit of mantissa? Such trimming in meteo data would have severe implications for our simulations with SILAM, and I believe these quantities should have more bits. It would be great if I could reproduce your result locally. Did you use the CAMS data from MARS (via GRIB directly, grib2netcdf conversion, or something else) for your evaluation, or were it raw double-precision data?

As a side comment (do not take it too seriously). Using a size of double-precision as a reference makes the results twice more impressive. I should have done it in my paper as well. :) Honestly, I could not imagine people storing large amounts of double-precision data on a disk/tape ...

rabernat commented 2 years ago

I have turned @rkouznetsov's 8 lines of code into a new codec in #299 and copied some of @milankl's rounding tests. It seems to be kind of working, but there are some things I don't understand. I'm new to bit manipulation, so I would really appreciate some help. I have some specific questions over in that PR.

milankl commented 2 years ago

Could you please confirm that cloud quantities in CAMS-GLOB indeed need only one (implicit) bit of mantissa? Such trimming in meteo data would have severe implications for our simulations with SILAM, and I believe these quantities should have more bits. It would be great if I could reproduce your result locally. Did you use the CAMS data from MARS (via GRIB directly, grib2netcdf conversion, or something else) for your evaluation, or were it raw double-precision data?

The data we used is obtained circumventing any linear quantisation compression that's otherwise applied in GRIB on MARS, you can find a description (and doi) in the data availability statement. Hence, the output data was only compressed by converting from double precision to single precision. Calculating the bitwise information from previously linear quantised data (as default for GRIB on MARS) comes with issues that is described as "artificial information" in the discussion of our paper.

You are absolutely right, that we haven't fully understood yet what the bitwise information means for the cloud/precipitation quantities (crwc, cloud rain water content etc), and I'd be more than happy to be involved into any critical discussion on this. If you plan to investigate this further I'd be keen what know what you find out! Cloud cover cc has only 255 different values as output by the IFS model even without any compression applied, I believe crwc, cswc, ciwc, clwc are similar. I don't understand yet whether we are facing a limitation of the method here, or whether this is more a question of interpretation.

As a side comment (do not take it too seriously). Using a size of double-precision as a reference makes the results twice more impressive. I should have done it in my paper as well. :) Honestly, I could not imagine people storing large amounts of double-precision data on a disk/tape ...

Haha, yes, we argued about that. I know that compression factors relative to 64 bits sound ridiculously high, as it's not a typical storage format. But even if the conversion from 64 bits (if that's what the model uses for calculations) to 32 bits sounds trivial, it is a compression in itself. So, we just found it more consistent to provide compression factors relative to no compression, and not relative to some "default" compression that's less clearly defined across applications.

observingClouds commented 2 years ago

Thanks @milankl for your great work and @rabernat for starting the discussion here and add the PR. I'm really interested into this topic and used your BitRounding class to apply it to our simulation output (ICON). The compression with BitRounding as filter and zstd with bit shuffling are really great. There is a huge interest in this PR and I'm happy to help where I can. I also realised that the information content for properties like surface rain amount has to be interpreted with care. Nevertheless that should not hold off this PR, right?

milankl commented 2 years ago

Yes, I believe the first step is to get the bitrounding + some lossless codecs supported. That's the only required step to do the actual compression, assuming you already know how many bits are of interest.

Yesterday, I've released BitInformation.jl v0.4, which is supposed to be the reference implementation to analyse the bitwise real information content. In the long term projects like nco or numcodecs may implement their own version or wrap around the Julia one, but I don't think this is neither now not for me to decide. However, as you point out there are datasets for which the interpretation of this bitwise is not trivial. While in the long run it would be amazing to have this pipeline automated, I believe for now these should be two rather indepdent steps. Hence, I encourage everyone to raise an issue in BitInformation.jl if they apply it to their own data. This should help