zarr-developers / zarr-python

An implementation of chunked, compressed, N-dimensional arrays for Python.
https://zarr.readthedocs.io
MIT License
1.5k stars 278 forks source link

Default compression settings #8

Open mrocklin opened 8 years ago

mrocklin commented 8 years ago

I have noticed performance increases in other projects when I choose default compression settings based on dtype.

Optimal compression settings depend strongly on bit patterns. Data types often strongly indicate bit pattern characteristics. For example integers often benefit more from compression than floats. Datetimes are often nearly sorted and so benefit more from shuffle.

It might improve performance to change the compression defaults in defaults.py to come from a function that takes the dtype as an input.

alimanfoo commented 8 years ago

Thanks @mrocklin, nice thought. Do you think you know enough to be able to propose a concrete implementation of that function? Or would it need some discussion and/or input from others?

mrocklin commented 8 years ago

These were the defaults that I was using in castra. They came from some ad-hoc benchmarking on the NYCTaxi dataset. Mostly I found that, for floating point values, intense compression was of marginal value.

def blosc_args(dt):
    if np.issubdtype(dt, int):
        return bloscpack.BloscArgs(dt.itemsize, clevel=3, shuffle=True)
    if np.issubdtype(dt, np.datetime64):
        return bloscpack.BloscArgs(dt.itemsize, clevel=3, shuffle=True)
    if np.issubdtype(dt, float):
        return bloscpack.BloscArgs(dt.itemsize, clevel=1, shuffle=False)
    return None
alimanfoo commented 8 years ago

Do you apply this for all compressors or just blosclz?

On Wed, Dec 23, 2015 at 4:27 PM, Matthew Rocklin notifications@github.com wrote:

These were the defaults that I was using in castra. They came from some ad-hoc benchmarking on the NYCTaxi dataset. Mostly I found that, for floating point values, intense compression was of marginal value.

def blosc_args(dt): if np.issubdtype(dt, int): return bloscpack.BloscArgs(dt.itemsize, clevel=3, shuffle=True) if np.issubdtype(dt, np.datetime64): return bloscpack.BloscArgs(dt.itemsize, clevel=3, shuffle=True) if np.issubdtype(dt, float): return bloscpack.BloscArgs(dt.itemsize, clevel=1, shuffle=False) return None

— Reply to this email directly or view it on GitHub https://github.com/alimanfoo/zarr/issues/8#issuecomment-166936678.

Alistair Miles Head of Epidemiological Informatics Centre for Genomics and Global Health http://cggh.org The Wellcome Trust Centre for Human Genetics Roosevelt Drive Oxford OX3 7BN United Kingdom Web: http://purl.org/net/aliman Email: alimanfoo@googlemail.com alimanfoo@gmail.com Tel: +44 (0)1865 287721

mrocklin commented 8 years ago

Castra only used blosclz I think

mrocklin commented 8 years ago

I wouldn't take too much from that project. The general lesson learned was that compression was way more useful on ints/datetimes than on the floating point data that I was looking at at the time.

alimanfoo commented 8 years ago

Thanks, it's well worth having this knowledge captured somewhere at least, even if only in documentation. I have other snippets like for (integer) genotype data zlib level 1 gets you good compression with good speed, increasing compression level above that adds very little while slowing things down a lot. This may not generalise to other datasets of course.

On Wednesday, December 23, 2015, Matthew Rocklin notifications@github.com wrote:

I wouldn't take too much from that project. The general lesson learned was that compression was way more useful on ints/datetimes than on the floating point data that I was looking at at the time.

— Reply to this email directly or view it on GitHub https://github.com/alimanfoo/zarr/issues/8#issuecomment-166938303.

Alistair Miles Head of Epidemiological Informatics Centre for Genomics and Global Health http://cggh.org The Wellcome Trust Centre for Human Genetics Roosevelt Drive Oxford OX3 7BN United Kingdom Web: http://purl.org/net/aliman Email: alimanfoo@googlemail.com alimanfoo@gmail.com Tel: +44 (0)1865 287721

alimanfoo commented 8 years ago

I don't feel I have enough experience of a range of datasets to be confident about defining good compression defaults based on dtype alone at the moment. In my limited experience a lot also depends on the correlation structure in the data as well, so patterns from one dataset may not generalise to another.

I propose to close this issue for now but reopen in future if some clear recommendations emerge supported by experiences with a variety of data.

If anyone finds this in the mean time feel free to add thoughts on what default compression settings should be. Default settings are currently fixed as using the blosclz compressor with compression level 5 and the byte shuffle filter.

mrocklin commented 8 years ago

I wonder if @falted could jump in here with a few sentences about how he would set compression defaults knowing only the dtype.

alimanfoo commented 8 years ago

A pretty obvious rule would be to use the bitshuffle filter instead of byte shuffle for single byte dtypes.

On Monday, April 11, 2016, Matthew Rocklin notifications@github.com wrote:

I wonder if @falted could jump in here with a few sentences about how he would set compression defaults knowing only the dtype.

— You are receiving this because you commented. Reply to this email directly or view it on GitHub https://github.com/alimanfoo/zarr/issues/8#issuecomment-208564495

Alistair Miles Head of Epidemiological Informatics Centre for Genomics and Global Health http://cggh.org The Wellcome Trust Centre for Human Genetics Roosevelt Drive Oxford OX3 7BN United Kingdom Email: alimanfoo@googlemail.com alimanfoo@gmail.com Web: http://purl.org/net/aliman Twitter: https://twitter.com/alimanfoo Tel: +44 (0)1865 287721

FrancescAlted commented 8 years ago

Revising these issues I stumbled upon this (BTW @falted is not my nickname in github). I agree in that making too much assumptions on compression parameters based on dtype is risky. In fact, even that in Blosc the shuffle is active by default, it is not unusual to find datasets that work better without it.

Also, I'm +1 on Alistair suggestion to activate the bitshuffle filter for single-byte dtypes (but still, I am not even sure if this could be beneficial for string data).