zarr-developers / zarr-python

An implementation of chunked, compressed, N-dimensional arrays for Python.
https://zarr.readthedocs.io
MIT License
1.45k stars 273 forks source link

Consider support for storing chunks in z-order #40

Open shoyer opened 8 years ago

shoyer commented 8 years ago

Apparently, this can result in significant IO savings when indexing multi-dimensional arrays.

See http://bl.ocks.org/jaredwinick/5073432 (note that the graphic is interactive)

alimanfoo commented 8 years ago

That's a neat trick. However, unless I'm missing something (very possible), I think that simply chunking an array will produce a very similar effect. I.e., if you have an array with shape (100, 100) and you set the chunk shape to (10, 10), then a request to retrieve data for the range [5:15, 5:15] will only require accessing 4 chunks. A

On Wednesday, 27 July 2016, Stephan Hoyer notifications@github.com wrote:

Apparently, this can result in significant IO savings when indexing multi-dimensional arrays.

See http://bl.ocks.org/jaredwinick/5073432 (note that the graphic is interactive)

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/alimanfoo/zarr/issues/40, or mute the thread https://github.com/notifications/unsubscribe-auth/AAq8QnhfOfSUxOJS9wws7kXBbmV_OWqaks5qZp7qgaJpZM4JVtUe .

Alistair Miles Head of Epidemiological Informatics Centre for Genomics and Global Health http://cggh.org The Wellcome Trust Centre for Human Genetics Roosevelt Drive Oxford OX3 7BN United Kingdom Email: alimanfoo@googlemail.com alimanfoo@gmail.com Web: http://purl.org/net/aliman Twitter: https://twitter.com/alimanfoo Tel: +44 (0)1865 287721

alimanfoo commented 8 years ago

Sorry accidentally sent last comment incomplete. Basically I think that chunking gives you data locality when taking multidimensional slices of an array, depending on the chosen chunk shape. So it's not obvious to me how z-order could be used with chunked arrays to improve I/O.

Cc @benjeffery

On Saturday, 30 July 2016, Alistair Miles <alimanfoo@googlemail.com javascript:_e(%7B%7D,'cvml','alimanfoo@googlemail.com');> wrote:

That's a neat trick. However, unless I'm missing something (very possible), I think that simply chunking an array will produce a very similar effect. I.e., if you have an array with shape (100, 100) and you set the chunk shape to (10, 10), then a request to retrieve data for the range [5:15, 5:15] will only require accessing 4 chunks. A

On Wednesday, 27 July 2016, Stephan Hoyer notifications@github.com wrote:

Apparently, this can result in significant IO savings when indexing multi-dimensional arrays.

See http://bl.ocks.org/jaredwinick/5073432 (note that the graphic is interactive)

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/alimanfoo/zarr/issues/40, or mute the thread https://github.com/notifications/unsubscribe-auth/AAq8QnhfOfSUxOJS9wws7kXBbmV_OWqaks5qZp7qgaJpZM4JVtUe .

Alistair Miles Head of Epidemiological Informatics Centre for Genomics and Global Health http://cggh.org The Wellcome Trust Centre for Human Genetics Roosevelt Drive Oxford OX3 7BN United Kingdom Email: alimanfoo@googlemail.com Web: http://purl.org/net/aliman Twitter: https://twitter.com/alimanfoo Tel: +44 (0)1865 287721

Alistair Miles Head of Epidemiological Informatics Centre for Genomics and Global Health http://cggh.org The Wellcome Trust Centre for Human Genetics Roosevelt Drive Oxford OX3 7BN United Kingdom Email: alimanfoo@googlemail.com alimanfoo@gmail.com Web: http://purl.org/net/aliman Twitter: https://twitter.com/alimanfoo Tel: +44 (0)1865 287721

shoyer commented 8 years ago

This would be for storing data within chunks, supposing support for partially reading chunks (possible with many backends). Given the fixed overhead for accessing each individual chunk, there is a minimal chunk size at which it doesn't make sense to chunk any further (depending on details, perhaps somewhere in the range of 1e3 to 1e6 elements). This could yield significant benefits in that case.

alimanfoo commented 8 years ago

Ah OK. So what's the main use case? Taking a multidimensional slice where the requested region is entirely within a single chunk?

Just so I'm completely with you, could you expand a bit more on what you mean by the second sentence ("Given the overhead for accessing each due...")?

On Sunday, July 31, 2016, Stephan Hoyer notifications@github.com wrote:

This would be for storing data within chunks, supposing support for partially reading chunks (possible with many backends). Given the overhead for accessing each due, there is a minimal chunk size at which it doesn't make sense to chunk any further (depending on details, perhaps somewhere in the range of 1e3 to 1e6 elements). This could yield significant benefits in that case.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/alimanfoo/zarr/issues/40#issuecomment-236452319, or mute the thread https://github.com/notifications/unsubscribe-auth/AAq8QqVh6JC9nwWo8E3PcARQnm3tcYKZks5qbPtXgaJpZM4JVtUe .

Alistair Miles Head of Epidemiological Informatics Centre for Genomics and Global Health http://cggh.org The Wellcome Trust Centre for Human Genetics Roosevelt Drive Oxford OX3 7BN United Kingdom Email: alimanfoo@googlemail.com alimanfoo@gmail.com Web: http://purl.org/net/aliman Twitter: https://twitter.com/alimanfoo Tel: +44 (0)1865 287721

shoyer commented 8 years ago

Ah OK. So what's the main use case? Taking a multidimensional slice where the requested region is entirely within a single chunk?

A more common use case might be any slicing operations that are poorly aligned with chunking, even if they involve multiple chunks. For example, suppose I have an array with chunks of size (100, 100, 100), and now want to index out a single point along the first two dimensions, e.g., x[i, j, :]. Instead of needing to read each full chunk all the first two dimensions, I could read out much less on average.

I don't have a direct use case for this right now (since I'm not actually using zarr yet), but I can see lots of examples where this sort of thing might be useful.

Just so I'm completely with you, could you expand a bit more on what you mean by the second sentence ("Given the overhead for accessing each due...")?

Oops, I had a typo (fixed now). What I was referring to is that there is some fixed overhead associated with storing and manipulating each chunk in every storage and task scheduling system, which is why we don't chunk arrays into single values.

alimanfoo commented 8 years ago

Thanks Stephan. Just to add that this relates to how Zarr might make use of the blosc_getitem function to extract partial contents of a chunk. I know Bcolz uses blosc_getitem but there things are simpler because they only consider 1D arrays, so far I haven't had the brain space to figure out how to use it for multidimensional arrays.

shoyer commented 8 years ago

blosc_getitem would indeed make good use of this. We would need some method for converting a tuple of range selections into start and nitems, and then load the appropriate items into the output array.

alimanfoo commented 7 years ago

I don't have bandwidth to explore this myself but very happy to discuss further if there is interest and someone else has time. FWIW I think a natural starting point would be to add support for a codec operation analogous to blosc_getitem, i.e., decoding items from some sub-range of a chunk. This would need to be generalised and added as a method on the Codec class. The Codec abstract base class could provide a default implementation which falls back to decoding then slicing the whole chunk, for codecs that cannot provide an efficient implementation. How this works if there were filters (i.e., multiple codecs) on an array I am uncertain. Once this infrastructure was in place, I guess a Zorder codec class could be implemented as a pre-compression filter, providing the z-order transformation and mapping getitem calls down to the next level? Or maybe it's not that simple, I can't see the architecture immediately.

Btw the next release of Zarr will have all code for compression and filter codecs removed and obtained via a dependency on a new numcodecs package, so this issue may live more naturally there as a codec issue.

One other note, I think that although z-order might speed up access for small regions of an array, it would incur a cost for reading larger regions, because all read and write operations would need to pass through the z-order transformation.