zarr-developers / zarr-specs

Zarr core protocol for storage and retrieval of N-dimensional typed arrays
https://zarr-specs.readthedocs.io/
Creative Commons Attribution 4.0 International
87 stars 28 forks source link

Self contained arrays #274

Open DennisHeimbigner opened 11 months ago

DennisHeimbigner commented 11 months ago

I get the impression that their is a hidden assumption in the spec that all the information about an array must be defined with the array. Two examples:

  1. named_dimensions are specific to a specific array and have no meaning outside that array's direct metadata.
  2. data type extensions must be repeated with each array that uses it. There are probably others that I have not yet spotted. In any case, if this is a hidden assumption, then it should be made explicit.
DennisHeimbigner commented 11 months ago

The discussion in Issue https://github.com/zarr-developers/zarr-specs/issues/273 seems to validate my assumption that each array is completely independent of all other arrays and its metadata is completely self contained including named dimensions and all extensions. I recall a discussion about this in a Zarr meeting a long time ago. The reason for this assumption is to support parallel processing at the array level so that a processor need not access any other metadata and can operate on the array completely independently. One consequence should be that groups serve only as namespaces and that the group's zarr.info should not need to exist. The one flaw in this is group level attributes. It is unclear what the use is for a group level attribute. If it impacts the processing of an array in any way, then it violates the self-contained nature of arrays. If it is to provide some documentation of the file, then it should be sufficient to only have attributes in the top-level group. Further these attributes should have no consequence for processing an array.

LDeakin commented 11 months ago

The spec says that array/group attributes are intended for storage of arbitrary user metadata. Attributes are not intended to change how arrays are encoded. If they did, that would not be supported by other implementations.

zoj613 commented 4 months ago

I have a question: How do implementations of the zarr spec represent the underlying array data? Are chunks just normal in-memory array objects (e.g a numpy array in python)? If so, what if the chunk is too big to fit in memory? Or is array manipulation only ever done via updating the metadata file? I'm interested in implementing the v3 spec in a functional language that currently doesn't have an implementation. I tried reading the spec but it does not seem to mention any example of how array chunks are represented in practice.