Open constantinpape opened 5 years ago
I hope I represented the discussions we had on this correctly, if not please everyone feel free to jump in. Also, several libraries that deal with type encoding were mentioned:
In this context, we should also keep in mind the netcdf data model
One point I want to chime in on:
- Is the chunk size (shape in python lingo) fixed or variable? Variable size would require header.
I don't see that variable-sized chunks ⟹ chunk-headers, if that's what the last sentence means.
I was imagining variably-sized chunks still conforming to a hyper-rectangular grid where [the chunk-size along a given dimension] is only dependent on [the chunk's index along that axis].
Notable benefits:
.zarray
's chunks
fieldSuppose we have a 3-D dataset where chunks have size (1,10,100,10,1) along dimension 1, (200,20,2,22) along dimension 2, and (3,30,300,30,3) along dimension 3.
.zarray
would look like:
{
…,
"chunks": [
[1,10,100,10,1],
[200,20,2,22],
[3,30,300,30,3]
],
…
}
There are 100 (5×4×5) chunks but we only require 14 (5+4+5) integers in .zarray
to express this scheme.
With n dimensions, where k_𝑖 is the number of chunks along dimension i, .zarray
's chunks
field be an array of n int-arrays, where the i-th int-array has k𝑖 elements.
That would contain (k₁ + k₂ + … + k𝗇) integers in total, which is much smaller than the number of chunks (k₁k₂…k𝗇), and would always be quite manageable.
In the worst case, you have a number of integers in .zarray
comparable to the number of chunk-files you've also written to disk; the former would never be the problem.
I was part of discussions last year where we brainstormed this kind of "heterogeneous grid" layout, but I can't find it on the Zarr repo; it may have been just with @tomwhite / @laserson or some CZI folks (@mckinsel?).
The impetus was that, in a Spark/Hadoop context, we might load a (1-D or 2-D, say) dataset, filter out some elements (1-D) or rows/cols (2-D), and want to save the resulting dataset to disk, but would incur an expensive "shuffle" stage to get back to evenly-spaced chunks.
Requiring perfectly-evenly-shaped chunks will add significant overhead in such settings.
[Supporting a "heterogenous grid" like I'm describing] should be orthogonal to [supporting other, more flexible variable-chunk-sizing schemes].
Thinking about it more, though, I don't see how you can really do any variable-chunk-sizing that doesn't conform to a "grid" like I'm describing… indexing/slicing quickly become undefined, unless I'm missing something?
Interested in others' thoughts! Sorry if some of this is covered elsewhere already. (xref: https://github.com/zarr-developers/zarr-specs/issues/40)
One point I want to chime in on:
- Is the chunk size (shape in python lingo) fixed or variable? Variable size would require header.
I don't see that variable-sized chunks ⟹ chunk-headers, if that's what the last sentence means.
Yes, you're correct. I changed the sentence to "Variable chunk size would require header or information about chunks in the metadata."
Also, the Heterogeneous-grid approach is very interesting. This could also be very useful for the prepend / append use-case we discussed.
Thanks @constantinpape for raising this, sorry for coming late to the party.
Just wanted to xref this issue: https://github.com/zarr-developers/zarr/issues/245 - @jakirkham raised a requirement for the heterogeneous grid (a.k.a., non-uniform chunking), the use case being to store dask arrays without having to rechunk to a uniform grid.
I'd also suggest we aim to break this issue up into a number of separate issues, each of which can come to a decision point. I don't have any concrete suggestions for how to do that right now, but will give it some thought.
Following up on today's call and #3, define a specification for how chunks are represented in memory before going through (compression) filters and storage.
Minimum requirement: a chunk can store nd-tensors of primitive datatypes. There was also a consensus to support big and little endian data (and C/F layout where appropriate).
On top, we discussed these questions:
Regarding 2.: Use case 1: storing edge chunks that are not fully covered. @axtimwalde pointed out that this allows direct mapping to memory without copying data in n5-imglib implementation. Use case 2: appending / prepending to datasets. This could be used to implement prepending to datasets without modifying existing chunks. Note that one of @alimanfoo's motivations to NOT implement variable chunk size was to always have valid chunks when appending to a dataset.
Regarding 3: The n5 use cases we discussed were simple examples like storing unique values in the spatial block corresponding to a chunk and more complicated examples like the n5-label-multiset. Also, this could be useful to define non-primitive datatypes, e.g. strings encoded via offsets and values. See also 4.
Regarding 4: During the discussion, several additional datatypes that could be supported were discussed:
More general, there is the question how we could provide a mechanism for extensions to the spec that define new datatypes. In the current zarr implementation, numpy arrays of objects can be stored via a special filter, see #6. In the current n5 implementation, non-primitive datatypes can be encoded into a varlength chunk and need to be decoded with a separate library (i.e. not part of n5-core) again.