Dimension names as core array metadata

alimanfoo commented 4 years ago

Several domains make use of named dimensions, i.e., for a given array with N dimensions, each of those N dimensions is given a human-readable name.

Given the broad utility of this, should we include this within the core array metadata in the v3 protocol? E.g., add a dimensions property within the array metadata document, whose value should be a list of strings:

    "shape": [10000, 1000],
    "dimensions": ["space", "time"],
    "data_type": "<f8",
    "chunk_grid": {
        "type": "regular",
        "chunk_shape": [1000, 100]
    },
    "chunk_memory_layout": "C",
    "compressor": {
        "codec": "https://purl.org/zarr/spec/codec/gzip/1.0",
        "configuration": {
            "level": 1
        }
    },
    "fill_value": "NaN",
    "extensions": [],
    "attributes": {
        "foo": 42,
        "bar": "apples",
        "baz": [1, 2, 3, 4]
    }
}

One question this raises is how to handle the case where no names are provided, or only some dimensions are named but not others. I.e., dimension names should probably be optional.

The alternative is that we leave this to the community to define a usage convention to store dimension names in the user attributes, e.g., similar to what xarray currently does using the "_ARRAY_DIMENSIONS" attribute name.

meggart commented 4 years ago

I would very much appreciate having an "official" way to define dimension names. Currently I mimic the xarray conventions in my Julia code but this feels a bit risky since these conventions are not properly versioned so if there is a change in the future in how these conventions are handled this could lead to unexpected bugs. So I don't mind if this is in the core protocol or in some extension as long as there is a clean way to find out programmatically after which convention dimension names are defined.

rabernat commented 4 years ago

I agree with this proposal.

It seems like we definitely want to synchronize this with whatever @DennisHeimbigner, @WardF, and the rest of the Unidata crew decide to do about dimension names.

DennisHeimbigner commented 4 years ago

This crosses a problem discussed in the meeting today. There is a strong feeling that the v3 spec should support asyncronous read and write to the degree possible. This is driven by cloud storage models. One consequence is that it should be possible for a process to directly create and write a variable without having to synchronize with any other process. However, it is unclear how this applies to shared dimensions. Should asynchronous creation of a named dimension by a process be allowed? =Dennis Heimbigner Unidata

On 6/3/2020 2:06 PM, Ryan Abernathey wrote:

I agree with this proposal.

It seems like we definitely want to synchronize this with whatever @DennisHeimbigner https://github.com/DennisHeimbigner, @WardF https://github.com/WardF, and the rest of the Unidata crew decide to do about dimension names.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/zarr-developers/zarr-specs/issues/73#issuecomment-638433379, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAG47W4YVMW3OWCKF7ZWN5TRU2UNRANCNFSM4NGT647A.

alimanfoo commented 4 years ago

I would suggest that, if we support dimension names in the v3 spec, then they are simply string labels for the dimensions of an array. Nothing else is implied. I.e., if two arrays happen to use the same name for a particular dimension, then at the level of the v3 protocol, that does not imply anything. It could mean that the two arrays have a "shared dimension" in the netCDF sense, it could just be coincidence, at least as far as a vanilla implementation of the v3 protocol is concerned.

A library that supports the full netCDF data model might then choose to treat these dimension names as names for shared dimensions, that would be fine and up to the netCDF layer implementation to manage.

Hope that makes sense.

DennisHeimbigner commented 4 years ago

However, the dimension name and size must be stored in the metadata independent of any variable. So adding a dimension may interfere with asynchronicity. =Dennis Heimbigner Unidata

On 6/3/2020 3:13 PM, Alistair Miles wrote:

I would suggest that, if we support dimension names in the v3 spec, then they are simply string labels for the dimensions of an array. Nothing else is implied. I.e., if two arrays happen to use the same name for a particular dimension, then at the level of the v3 protocol, that does not imply anything. It could mean that the two arrays have a "shared dimension" in the netCDF sense, it could just be coincidence, at least as far as a vanilla implementation of the v3 protocol is concerned.

A library that supports the full netCDF data model might then choose to treat these dimension names as names for shared dimensions, that would be fine and up to the netCDF layer implementation to manage.

Hope that makes sense.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/zarr-developers/zarr-specs/issues/73#issuecomment-638465262, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAG47WY3COBTE55ODXJSAXLRU24IBANCNFSM4NGT647A.

alimanfoo commented 4 years ago

However, the dimension name and size must be stored in the metadata independent of any variable. So adding a dimension may interfere with asynchronicity.

I may need some help from @rabernat here, there's a few different "dimensions" to this problem (sorry for the very bad pun :-)

Note that in this proposal I am simply proposing a metadata property for giving names to the dimensions (axes) of an array. Perhaps the property should be called dimension_names to make that clear. In any case, there is no implication that these dimensions are shared with any other arrays.

E.g., with this feature I could create an array with shape (10, 5) and name the dimensions ("foo", "bar"). In the zarr protocol, it would be totally fine to create another array with shape (100, 5) and name the dimensions ("foo", "qux"). I.e., creating each of these arrays is an independent operation, and the names are just labels for the axes of the arrays, not necessarily shared.

I.e., a vanilla zarr implementation would just offer the ability to provide names for the dimensions (axes) of an array, and might show those names when providing a visual representation of the array, but that would be it.

Now, a higher-level library implementing the netCDF data model might choose to interpret these as names for shared dimensions, under certain circumstances. I.e., if two arrays within the same group both have the name "foo" for one of their dimensions, then assume they are referring to a shared dimension.

This is similar to what xarray does currently. The main difference is that xarray uses an attribute called _ARRAY_DIMENSIONS, whereas this proposal offers a standard metadata property called dimensions (or dimension_names) which might be used for that purpose. There is a slight difference though, in that xarray knows that the _ARRAY_DIMENSIONS attribute is always supposed to indicate names for shared dimensions. I.e., there is stronger semantics for _ARRAY_DIMENSIONS than for the proposed dimensions array metadata property.

Perhaps it would be easier to avoid potential confusion, and for zarr to not try to cross into the netCDF space, and rather allow that to be dealt with via a set of usage conventions that properly deal with the netCDF semantics, such as the xarray approach or the nzcarr approach.

alimanfoo commented 4 years ago

However, the dimension name and size must be stored in the metadata independent of any variable.

Also noting that IIUC this is not necessarily true, e.g., the xarray approach does not separately store dimension names and sizes. This is different from the nczarr proposal. Note that I have no opinion on which of these two approaches is best, just noting the difference.

rabernat commented 4 years ago

Note that in this proposal I am simply proposing a metadata property for giving names to the dimensions (axes) of an array. Perhaps the property should be called dimension_names to make that clear. In any case, there is no implication that these dimensions are shared with any other arrays.

:+1: This is how I have been thinking of it. Rather than calling the axes 0, 1, 2, we can call them time, lat, lon. Additional extensions or application could decide to interpret this in different ways, such as in the netCDF data model.

However, the dimension name and size must be stored in the metadata independent of any variable.

I don't see why. The dimension size is the determined by the shape of the array.

DennisHeimbigner commented 4 years ago

I am glad we have these kinds of discussions; I am to some degree captive of the historical development of netcdf and its assumptions.

Does this interpretation seem reasonable WRT the xarray model?

the definition of a named dimension is distributed (an important word) to all of the variables which use it. There is no single centralized definition as in netcdf.
The costs for the xarray approach are: a. inconsistency between the distributed named dimension definitions is possible b. the cost in storing the named dimension info in multiple variables.

The cost for (2b) seems very small and so is not a big issue. The (2a) case is no different than any other hidden data used in, say, netcdf. Presumably the inconsistencies can only occur if the dataset is modified outside of the library.

Since in netcdf, dimensions are scoped by groups, one would need to use the fully qualified names (FQNs) for named dimensions: e.g.. /g1/g2/dim1.

It would seem that some kind of search is needed to guarantee dimension name uniqueness. It potentially requires looking at all variables within the group part of the FQN of the new dimension to ensure that the name is unique. Does xarray do a similar search when a client defines a new dimension?

In any case, the distributed approach is attractive because it potentially allows asynchronous definition of dimensions if certain constraints can be met so that search can be avoided or minimized.

Comments?

=Dennis Heimbigner Unidata

On 6/4/2020 3:49 AM, Alistair Miles wrote:

However, the dimension name and size must be stored in the
metadata independent of any variable.
Also noting that IIUC this is not necessarily true, e.g., the xarray approach http://xarray.pydata.org/en/latest/internals.html#zarr-encoding-specification does not separately store dimension names and sizes. This is different from the nczarr proposal https://drive.google.com/file/d/1UUGcQMpWqKllMdRFCu97CoL7fB_GWXvg/view. Note that I have no opinion on which of these two approaches is best, just noting the difference.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/zarr-developers/zarr-specs/issues/73#issuecomment-638744706, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAG47WY3CMRUWUZ5I2VT2BLRU5U2JANCNFSM4NGT647A.

joshmoore commented 4 years ago

Thinking out loud somewhat, I wonder if restricting dimension_names to [a-zA-Z0-9_] for the moment wouldn't be prudent. That would allow nice Python referencing and would allow a potential future extension to pathed (/) or dotted (.) nomenclature for looking up named dimensions in the future?

Carreau commented 4 years ago

Update RFC to say this is something we'd like input on.

jbms commented 2 years ago

I would also like to see built-in support for dimension names, and would also suggest that, for simplicity, the zarr specification itself make no assumptions about "shared dimensions" between multiple arrays.

Aside from possible constraints on the allowed characters, I think that empty labels should be allowed (and indicate an unnamed dimension), and non-empty labels must be distinct. Not specifying the dimension names at all would be equivalent to specify all empty strings as the dimension labels.

d-v-b commented 2 years ago

What's the advantage of allowing empty labels?

jbms commented 2 years ago

Given that dimension names would be optional, it seems natural to me to allow that optionality on a per-dimension basis. E.g. maybe you are computing some sort of multiplication or partial reduction between two zarr arrays A and B, where A has labels and B does not. If the result has some dimensions corresponding to dimensions of A and some dimensions corresponding to dimensions of B, we would like to preserve the dimension labels from A without having to invent fake labels for B.

However, I don't feel too strongly about allowing empty labels.

DennisHeimbigner commented 2 years ago

I assume that this would operate like _ARRAY_DIMENSIONS in that the size of the named dimension is determined from the corresponding position in the "shapes" key. This of course can lead to inconsistency in the size of a named dimension. Not surprisingly, I prefer the netcdf approach where the name and size are declared separately from any variable so that inconsistency is not possible.

DennisHeimbigner commented 2 years ago

Another point. Unless you require all dimension names to be "global", then you will need to use fully qualified names (fqn) for dimension names. So one might have something like this.

"dimensions":` ["/dim1", "/grp1/grp2/dim2"]

DennisHeimbigner commented 2 years ago

WRT anonymous dimensions. One approach is to merge the shape and dimension keys and make dimension names be JSON strings and anonymous dimensions be integers. This avoids empty labels.

jbms commented 2 years ago

If we allow anonymous dimensions, then I would say they indeed have to be specified by their index rather than name, but of course named dimensions could also be specified by index.

And in many contexts, e.g. for display to a user, I agree that it would be very natural to display just the index in place of the name for anonymous dimensions.

Although the dimension names could be quoted to avoid ambiguity, it might also be good to disallow dimension names that consist only of digits 0-9.

However, I'm unclear exactly what you are proposing as far as having dimension names be either strings or integers. Would that just be a concern of a specific implementation, rather than the zarr spec itself?

Also as far as referencing dimensions by path, as far as I can tell nothing in the current spec requires referencing dimensions; I suppose you are thinking from the context of an extension like ome-zarr or a version of netcdf built on top of zar

While I agree that the netcdf data model makes a lot of sense in many cases, I'm not sure how well the unique dimension names constraint / consistent size for every named dimension constraint fits with all intended uses of zarr v3. I guess users could always work around that issue by putting each zarr array in a separate zarr repository, but users might wish to get other data organizational advantages of having multiple arrays in a single zarr repository without constraining themselves to the netcdf data model.

DennisHeimbigner commented 2 years ago

Although the dimension names could be quoted to avoid ambiguity, it might also be good to disallow dimension names that consist only of digits 0-9.

That is the reason I made the string vs number distinction. And the fact that netcdf allows dimension names that are all digits.

DennisHeimbigner commented 2 years ago

Also as far as referencing dimensions by path, as far as I can tell nothing in the current spec requires referencing dimensions; I suppose you are thinking from the context of an extension like ome-zarr or a version of netcdf built on top of zar

I do not understand this comment. I was referring to a case where we have a variable v1 defined in a group /g1 (i.e just below the root group) something like this:

"shape": [ 1, 17] "dimensions" ["dim1", "dim17"]

Suppose we have another variable v2 in group /g2.

"shape": [ 17] "dimensions" ["dim17"]

How do we know that the two dim17's refer to the same dimension? I would prefer that "dim17" be replaced with "/g1/dim17" so that it is clear that the same dimension is being used.

Of course, this assumes one wants the shared dimension name semantics to matter, but that, of course, is the whole point of named dimensions.

jbms commented 2 years ago

It seems like just using a unique dimension name might be more natural than specifying a dimension by reference to another array, but I am not sure.

Certainly netcdf shared dimension semantics are applicable in some applications, but I think there are other applications where dimension names are useful but the constraint that all dimensions with a given name should have the same extent is not useful. For example:

multiscale dataset, where you have arrays storing the data at multiple scales. Here the dimension names could indicate the correspondence between the dimensions of the arrays at different scales, but the extents will of course be different.
a large collection of images, with dimensions x, y, c, and a convolutional neural network model with input dimensions x, y, c. All of the images may have different x, y dimensions but you want to apply the neural network model to them, and be sure you aren't accidentally transposing x and y.

DennisHeimbigner commented 2 years ago

t seems like just using a unique dimension name might be more natural than specifying a dimension by reference to another array, but I am not sure.

In a sense I agree which is why netcdf declares dimensions separately from variables. But it appears that this community would rather declare the dimensions as part of the variable declaration.

DennisHeimbigner commented 2 years ago

Your examples still prove my point. You are assuming that the dimensions with the same name are semantically the same. The issue is being able to use the same simple name (e.g. "x") in multiple places with different extents. But you still need to disambiguate those multiple declarations and using the fqn is IMHO the best way to do that.

DennisHeimbigner commented 2 years ago

I think that coordinate variables are important in this discussion.

Suppose we have the following:

dimensions:  lat=5, lon=4;
variables:
float temp(lat,lon);
float lat(lat);
float lon(long);

The temp variable represents the temperature at a given latitude and longitude.

The longitude values are, say, -1deg. thru 2deg. and the latitude values are, say, -0.5deg. thru 1.5deg. However the lat dimension runs from 0 thru 4 and lon runs from 0 thru 3. The so-called coordinate variables map the raw indices to the actual lat and lon values of the coordinates. So we have:

lat = -0.5, 0.0, 0.5, 1.0, 1.5 ;
lon = -1.0, 0.0, 1.0, 2.0 ;

This concept of coordinate variables is extremely useful but it relies on the use of shared names to indicate shared semantics.

jbms commented 2 years ago

I agree that shared names to indicate "shared semantics" in some sense is the point of named dimensions, but I think exactly what those "shared semantics" are depends on the application.

If zarr were to use the netcdf data model, where shared name means shared domain, then how do you propose to deal with the use case of a single zarr repository where the root group contains a collection of arrays named sample0, sample1, ..., sampleN. Each of these samples are 3-d xyc images but they don't all have the same x and y dimensions. How would we assign dimension names in this case?

DennisHeimbigner commented 2 years ago

In netcdf, you put the various dimensions in different groups (possibly with the relevant variables).

jstriebel commented 1 year ago

Crosslinking https://github.com/zarr-developers/zarr-specs/pull/149#discussion_r927300522

jstriebel commented 1 year ago

Resolved via #162.

zarr-developers / zarr-specs

Dimension names as core array metadata #73