Open mrocklin opened 8 years ago
Thanks @mrocklin, nice thought. Do you think you know enough to be able to propose a concrete implementation of that function? Or would it need some discussion and/or input from others?
These were the defaults that I was using in castra. They came from some ad-hoc benchmarking on the NYCTaxi dataset. Mostly I found that, for floating point values, intense compression was of marginal value.
def blosc_args(dt):
if np.issubdtype(dt, int):
return bloscpack.BloscArgs(dt.itemsize, clevel=3, shuffle=True)
if np.issubdtype(dt, np.datetime64):
return bloscpack.BloscArgs(dt.itemsize, clevel=3, shuffle=True)
if np.issubdtype(dt, float):
return bloscpack.BloscArgs(dt.itemsize, clevel=1, shuffle=False)
return None
Do you apply this for all compressors or just blosclz?
On Wed, Dec 23, 2015 at 4:27 PM, Matthew Rocklin notifications@github.com wrote:
These were the defaults that I was using in castra. They came from some ad-hoc benchmarking on the NYCTaxi dataset. Mostly I found that, for floating point values, intense compression was of marginal value.
def blosc_args(dt): if np.issubdtype(dt, int): return bloscpack.BloscArgs(dt.itemsize, clevel=3, shuffle=True) if np.issubdtype(dt, np.datetime64): return bloscpack.BloscArgs(dt.itemsize, clevel=3, shuffle=True) if np.issubdtype(dt, float): return bloscpack.BloscArgs(dt.itemsize, clevel=1, shuffle=False) return None
— Reply to this email directly or view it on GitHub https://github.com/alimanfoo/zarr/issues/8#issuecomment-166936678.
Alistair Miles Head of Epidemiological Informatics Centre for Genomics and Global Health http://cggh.org The Wellcome Trust Centre for Human Genetics Roosevelt Drive Oxford OX3 7BN United Kingdom Web: http://purl.org/net/aliman Email: alimanfoo@googlemail.com alimanfoo@gmail.com Tel: +44 (0)1865 287721
Castra only used blosclz I think
I wouldn't take too much from that project. The general lesson learned was that compression was way more useful on ints/datetimes than on the floating point data that I was looking at at the time.
Thanks, it's well worth having this knowledge captured somewhere at least, even if only in documentation. I have other snippets like for (integer) genotype data zlib level 1 gets you good compression with good speed, increasing compression level above that adds very little while slowing things down a lot. This may not generalise to other datasets of course.
On Wednesday, December 23, 2015, Matthew Rocklin notifications@github.com wrote:
I wouldn't take too much from that project. The general lesson learned was that compression was way more useful on ints/datetimes than on the floating point data that I was looking at at the time.
— Reply to this email directly or view it on GitHub https://github.com/alimanfoo/zarr/issues/8#issuecomment-166938303.
Alistair Miles Head of Epidemiological Informatics Centre for Genomics and Global Health http://cggh.org The Wellcome Trust Centre for Human Genetics Roosevelt Drive Oxford OX3 7BN United Kingdom Web: http://purl.org/net/aliman Email: alimanfoo@googlemail.com alimanfoo@gmail.com Tel: +44 (0)1865 287721
I don't feel I have enough experience of a range of datasets to be confident about defining good compression defaults based on dtype alone at the moment. In my limited experience a lot also depends on the correlation structure in the data as well, so patterns from one dataset may not generalise to another.
I propose to close this issue for now but reopen in future if some clear recommendations emerge supported by experiences with a variety of data.
If anyone finds this in the mean time feel free to add thoughts on what default compression settings should be. Default settings are currently fixed as using the blosclz compressor with compression level 5 and the byte shuffle filter.
I wonder if @falted could jump in here with a few sentences about how he would set compression defaults knowing only the dtype.
A pretty obvious rule would be to use the bitshuffle filter instead of byte shuffle for single byte dtypes.
On Monday, April 11, 2016, Matthew Rocklin notifications@github.com wrote:
I wonder if @falted could jump in here with a few sentences about how he would set compression defaults knowing only the dtype.
— You are receiving this because you commented. Reply to this email directly or view it on GitHub https://github.com/alimanfoo/zarr/issues/8#issuecomment-208564495
Alistair Miles Head of Epidemiological Informatics Centre for Genomics and Global Health http://cggh.org The Wellcome Trust Centre for Human Genetics Roosevelt Drive Oxford OX3 7BN United Kingdom Email: alimanfoo@googlemail.com alimanfoo@gmail.com Web: http://purl.org/net/aliman Twitter: https://twitter.com/alimanfoo Tel: +44 (0)1865 287721
Revising these issues I stumbled upon this (BTW @falted is not my nickname in github). I agree in that making too much assumptions on compression parameters based on dtype is risky. In fact, even that in Blosc the shuffle is active by default, it is not unusual to find datasets that work better without it.
Also, I'm +1 on Alistair suggestion to activate the bitshuffle filter for single-byte dtypes (but still, I am not even sure if this could be beneficial for string data).
I have noticed performance increases in other projects when I choose default compression settings based on dtype.
Optimal compression settings depend strongly on bit patterns. Data types often strongly indicate bit pattern characteristics. For example integers often benefit more from compression than floats. Datetimes are often nearly sorted and so benefit more from shuffle.
It might improve performance to change the compression defaults in
defaults.py
to come from a function that takes the dtype as an input.