constantinpape commented 5 years ago

Following up on today's call and #3, define a specification for how chunks are represented in memory before going through (compression) filters and storage.

Minimum requirement: a chunk can store nd-tensors of primitive datatypes. There was also a consensus to support big and little endian data (and C/F layout where appropriate).

On top, we discussed these questions:

Does a chunk have a header? Is the header required or optional?
Is the chunk size (shape in python lingo) fixed or variable? Variable chunk size would require header or information about chunks in the metadata.
Is the number of values stored in the chunk always determined by the shape (i.e. product of the shape)? If not, would implement something akin to n5 varlength mode and require a header.
How do we support non-primitive datatypes, e.g. strings or VLen / ragged arrays? Could be implemented via 3. or something akin to the current zarr spec.

Regarding 2.: Use case 1: storing edge chunks that are not fully covered. @axtimwalde pointed out that this allows direct mapping to memory without copying data in n5-imglib implementation. Use case 2: appending / prepending to datasets. This could be used to implement prepending to datasets without modifying existing chunks. Note that one of @alimanfoo's motivations to NOT implement variable chunk size was to always have valid chunks when appending to a dataset.

Regarding 3: The n5 use cases we discussed were simple examples like storing unique values in the spatial block corresponding to a chunk and more complicated examples like the n5-label-multiset. Also, this could be useful to define non-primitive datatypes, e.g. strings encoded via offsets and values. See also 4.

Regarding 4: During the discussion, several additional datatypes that could be supported were discussed:

Strings
VLen / ragged arrays
label-multisets

More general, there is the question how we could provide a mechanism for extensions to the spec that define new datatypes. In the current zarr implementation, numpy arrays of objects can be stored via a special filter, see #6. In the current n5 implementation, non-primitive datatypes can be encoded into a varlength chunk and need to be decoded with a separate library (i.e. not part of n5-core) again.

constantinpape commented 5 years ago

I hope I represented the discussions we had on this correctly, if not please everyone feel free to jump in. Also, several libraries that deal with type encoding were mentioned:

ndtypes / xnd
parquet
awkward-array
more exotic: ROOT (Fun fact my supervisor started her career as a developer of ROOT).

In this context, we should also keep in mind the netcdf data model

ryan-williams commented 5 years ago

One point I want to chime in on:

Is the chunk size (shape in python lingo) fixed or variable? Variable size would require header.

I don't see that variable-sized chunks ⟹ chunk-headers, if that's what the last sentence means.

Mini-proposal: "heterogeneous grid" variable-chunk-sizing

I was imagining variably-sized chunks still conforming to a hyper-rectangular grid where [the chunk-size along a given dimension] is only dependent on [the chunk's index along that axis].

Notable benefits:

metadata describing this should always be trivial to store .zarray's chunks field
fast random-accessing and slicing operations can still be easily supported

Example

Suppose we have a 3-D dataset where chunks have size (1,10,100,10,1) along dimension 1, (200,20,2,22) along dimension 2, and (3,30,300,30,3) along dimension 3.

.zarray would look like:

{
  …, 
  "chunks": [
    [1,10,100,10,1],
    [200,20,2,22],
    [3,30,300,30,3]
  ],
  …
}

There are 100 (5×4×5) chunks but we only require 14 (5+4+5) integers in .zarray to express this scheme.

In general

With n dimensions, where k_𝑖 is the number of chunks along dimension i, .zarray's chunks field be an array of n int-arrays, where the i-th int-array has k𝑖 elements.

That would contain (k₁ + k₂ + … + k𝗇) integers in total, which is much smaller than the number of chunks (k₁k₂…k𝗇), and would always be quite manageable.

In the worst case, you have a number of integers in .zarray comparable to the number of chunk-files you've also written to disk; the former would never be the problem.

Motivation: distributed processing

I was part of discussions last year where we brainstormed this kind of "heterogeneous grid" layout, but I can't find it on the Zarr repo; it may have been just with @tomwhite / @laserson or some CZI folks (@mckinsel?).

The impetus was that, in a Spark/Hadoop context, we might load a (1-D or 2-D, say) dataset, filter out some elements (1-D) or rows/cols (2-D), and want to save the resulting dataset to disk, but would incur an expensive "shuffle" stage to get back to evenly-spaced chunks.

Requiring perfectly-evenly-shaped chunks will add significant overhead in such settings.

"Heterogeneous-grid" chunking is orthogonal to other possible approaches

[Supporting a "heterogenous grid" like I'm describing] should be orthogonal to [supporting other, more flexible variable-chunk-sizing schemes].

What is the alternative?

Thinking about it more, though, I don't see how you can really do any variable-chunk-sizing that doesn't conform to a "grid" like I'm describing… indexing/slicing quickly become undefined, unless I'm missing something?

Interested in others' thoughts! Sorry if some of this is covered elsewhere already. (xref: https://github.com/zarr-developers/zarr-specs/issues/40)

constantinpape commented 5 years ago

One point I want to chime in on:

Is the chunk size (shape in python lingo) fixed or variable? Variable size would require header.

I don't see that variable-sized chunks ⟹ chunk-headers, if that's what the last sentence means.

Yes, you're correct. I changed the sentence to "Variable chunk size would require header or information about chunks in the metadata."

Also, the Heterogeneous-grid approach is very interesting. This could also be very useful for the prepend / append use-case we discussed.

alimanfoo commented 5 years ago

Thanks @constantinpape for raising this, sorry for coming late to the party.

Just wanted to xref this issue: https://github.com/zarr-developers/zarr/issues/245 - @jakirkham raised a requirement for the heterogeneous grid (a.k.a., non-uniform chunking), the use case being to store dask arrays without having to rechunk to a uniform grid.

I'd also suggest we aim to break this issue up into a number of separate issues, each of which can come to a decision point. I don't have any concrete suggestions for how to do that right now, but will give it some thought.

zarr-developers / zarr-specs

Chunk Spec #7

Mini-proposal: "heterogeneous grid" variable-chunk-sizing

Example

In general

Motivation: distributed processing

"Heterogeneous-grid" chunking is orthogonal to other possible approaches

What is the alternative?