zarr-developers / zarr-specs

Zarr core protocol for storage and retrieval of N-dimensional typed arrays
https://zarr-specs.readthedocs.io/
Creative Commons Attribution 4.0 International
88 stars 28 forks source link

Support for non-zero origin #122

Open jbms opened 2 years ago

jbms commented 2 years ago

I would love to have support in zarr (e.g. spec version 3) for a non-zero origin, i.e. an origin field in the metadata file to specify a non-zero origin.

In the volume electron microscopy domain, for example, it is natural to have many different arrays corresponding to sub-regions of a common coordinate space, and of course this occurs in many other application domains.

I am proposing that if you specify an origin of [1, 2, 3], then to read the value at the origin you would use the syntax:

arr[1, 2, 3]

and

arr[0, 0, 0] would be out of bounds.

This is not something that can be reasonable supported as an extension via a custom attribute, because it would lead to confusion since if you attempt to load the array using a zarr implementation that does not support a non-zero origin the indexing would behave differently.

This was also mentioned in this OME discussion: https://github.com/ome/ngff/pull/57#issuecomment-938533654

OME proposes to add support for specifying transformations, but those operate in continuous coordinate spaces and for compatibility with zarr would not affect integer indexing.

The implementation is fairly simple, the only main issue for zarr-python is that Python and NumPy allow negative indices to refer to positions relative to the end of a dimension, and for compatibility that behavior must be retained in zarr.

The solution I would propose is that for dimensions with an origin of 0, negative indices retain the current behavior of referring to a position relative to the end. For dimensions with a non-zero origin, negative indices receive no special treatment,. For example, if the origin is:

[0, 10, -100]

and the shape is [10, 20, 30]`

then arr[-1, 10, -90] refers to position 9, 10, -90 and arr[0, -1, 0] is out-of-bounds.

jakirkham commented 2 years ago

FWIW this idea has come up before ( https://github.com/zarr-developers/zarr-specs/issues/9 )

joshmoore commented 2 years ago

Quick summary of (my impression of) the discussion during yesterday's call:

jbms commented 2 years ago

Note that this is a feature supported by the neuroglancer precomputed format and has proved to be quite useful there.

OME-Zarr seems to be focused specifically on microscopy imagery-related use cases and additionally seems to be oriented towards continuous-space transformations.

I think it would be quite valuable to have discrete indexing support at the zarr level, so that indexing into the array with normal integer indices respects the offset.

As noted in #9 this also makes it very easy to extend data in the negative direction, or shift the origin after it is initially written.

It is true that you can apply translations at a higher level --- e.g. tensorstore index transforms (https://google.github.io/tensorstore/index_space.html#index-transform) provide a JSON representation for a wide class of indexing operations (essentially anything that is supported by numpy indexing, plus translation and transposition). And indeed a tensorstore "spec" can provide a JSON-serializable representation of the location to a zarr array with an index transform applied. However, in many cases software is designed to load a zarr array at a given path, and specifying just a path to a zarr array is much simpler than specifying both a path and an index transform. In conjunction with #129 this would enable a wide class of transformations of the coordinate space of the array solely by manipulating the array metadata, without introducing significant implementation complexity.

"Coordinate arrays" as supported by netcdf and xarray, where for a given dimension "x" of size N you have an additional array named "x" of shape [N] that specifies a "label" for each position along dimension "x", are of course very useful if your array corresponds to samples over an irregular grid. But non-zero origins could actually be particularly valuable in conjunction with coordinate arrays if you have an array that corresponds to just an interval of a larger array, as otherwise you would need to make a copy of a portion of the coordinate array, and there would be no way to directly indicate the correspondence.

fcollman commented 2 years ago

Hi I just wanted to weigh in on the user feedback side of things... I work at the Allen Institute where I work on managing several large scale EM datasets and associated segmentations. All of our segmentations end up using outputs of alignment pipelines that don't place all the data at 0,0 (and sometimes shift it into negative voxels) and so forcing us to store data started at 0,0 will incur significant costs or require tracking additional metadata to shift the reconstructed volumes from the naive output of the previous state. This added complexity and metadata will create adoption friction and reduce the usability of the library for us, so i'd strongly advocate for Jeremy's suggestion.

stuarteberg commented 2 years ago

I'd like to voice my support as well, for the same reasons @fcollman stated above. I work for a different institution, but we work with very similar data and have similar needs.

jakirkham commented 2 years ago

Basically this is a translation transform. Wonder if there are other transform needs that would make sense to capture (so as not to overfit). For example affine transforms (could come up in relation to units independent from the indices). Other coordinate mappings (different globe representations for example)?

Not to say we should try to solve all of these (or in this issue), but leaving the door open to solving them seems useful (should they become important).

fcollman commented 2 years ago

So to get more concrete as to a use case, if i have a say 2d tile alignment pipeline that lets me define some high order polynomial spline transform on a series of tiles and sections and now i want to materialize it into zarray file for faster viewing, access and subsequent processing. If the interface lets me simply iterate over my coordinate system and write data in chunks into it, you've simplified my life. If you ask me to define a 0 start block, write data into it, and then define a transformation that puts the data back into the original coordinate frame I think you've complicated my life. When you say transformations I worry that you are talking about the later and not the former if that makes sense. It's easy and lightweight to just tack some affine transform metadata on top of a file, but if the UI doesn't make reading/writing in that transformed space easy then its added complexity.... given how spatial chunking works it feels difficult to support things other than translations.

rabernat commented 2 years ago

I would like to encourage everyone on this thread to give another look at xarray for this application. Xarray is very close to completing a major refactor to enable much more flexible, pluggable indexes: https://github.com/pydata/xarray/blob/main/design_notes/flexible_indexes_notes.md Nearly every use cases described here would be supported by this new protocol, via a custom indexer.

The nice thing about Zarr's current indexing is that it is strictly logical. I would be wary about bringing any domain-specific notions of coordinate systems into Zarr itself.

edit to note that Xarray has first-class support for reading / writing Zarr.

constantinpape commented 2 years ago

Basically this is a translation transform. Wonder if there are other transform needs that would make sense to capture (so as not to overfit). For example affine transforms (could come up in relation to units independent from the indices). Other coordinate mappings (different globe representations for example)?

Not to say we should try to solve all of these (or in this issue), but leaving the door open to solving them seems useful (should they become important).

Related to this: ome-ngff has recently added support for simple transformations (scale and translation) and the work for adding affines and more complex transformations is on-going: https://github.com/ome/ngff/issues/94.

Personally I find introducing transformations in the zarr spec confusing, especially affines etc. that map to a continuous rather than a discrete space.

I personally also would rather use a translation to specify non-zero offset and not do it in the zarr spec, since, at least to my understanding, this offset is always w.r.t. other data, e.g. when placing a tile in a larger dataset: it might still be helpful to look at the tile in its own (zero based) coordinate system instead of enforcing the origin on the zarr level.

That being said, I would not object to adding the non-zero origin to the spec in principle, it's just something I personally would not use for the examples brought up here.

d-v-b commented 2 years ago

This offset is a transform, so it should probably be specified as part of a broader family of transforms, e.g. the ongoing OME-NGFF effort.

As for changing the semantics of array.__getitem__ ... this is what xarray and other libraries that wrap zarr are for, no?

constantinpape commented 2 years ago

As for changing the semantics of array.__getitem__ ... this is what xarray and other libraries that wrap zarr are for, no?

I agree with @d-v-b on this point, changing the semantics of __getitem__ depending on whether there is a non-zero offset does not sound like a good idea in zarr-python (or any other implementation) itself. This could lead to many issues if users are not aware they have an array with non-zero offset.

william-silversmith commented 2 years ago

I ran into this problem while developing CloudVolume for reading/writing the Precomputed format. Non-zero offset arrays can sometimes produce subtle shifts due to truncation effects when downsampling. Nonetheless, it does make it easy to figure out the correspondence to what you've written and the files that are materialized.

Since CloudVolume does prioritize the Precomputed format I came at this problem from the opposite angle: Should I support negative indices? I considered adding a flag to CloudVolume that would change the interpretation of slices to support them early on in 2017, but no one has ever asked me for this support. For smaller datasets, the negative indices are more important I think since the data on one side of the image is more spatially localized with the data on the other side and thus has some kind of useful relationship.

https://github.com/seung-lab/cloud-volume/issues/13

Just reporting my experience developing a very similar tool.

jbms commented 2 years ago

@jakirkham You raise a good point that we should consider more generally what types of coordinate transformations make sense to support. I hope we can agree on the following criteria:

Certainly continuous coordinate transformations are very useful, but I would say those should be left to other standards like OME-Zarr. And non one-to-one or non-onto transformations, like striding along a dimension, or indexing using a separate index array, are also very useful, but I would say those should really be left to a separate "transformed view" on top of the base array, since e.g. with non-onto transformations you would end up with the strange situation where elements may take up space inside chunks but cannot be addressed.

These constraints basically limit us to:

I don't have a strong opinion about reflection support --- I don't have a lot of use cases in mind, but it also isn't hard to support. It is also something that could be more easily added as an extension later, because it is really just changing the internal representation, not the public "interface". In contrast, allowing non-zero origins is a more fundamental change.

@rabernat I would agree that libraries like xarray and tensorstore that allow for virtual views of existing arrays to be extremely useful. And of course offsets can also be implemented by custom code in an application. But xarray is a library, not a data format. So you would need some out-of-band mechanism (such as a custom attribute) to communicate the offset. This proposal would essentially just amount to standardizing that attribute.

@constantinpape I would absolute agree that zarr should not include any continuous coordinate transforms, like affine transforms --- that should be left to other standards like OME-Zarr. But on the other hand I would say that the continuous coordinate transforms of OME-Zarr are not well-suited for representing integer offsets that would be respected by integer indexing operations. For example, if we have an OME-zarr transformation of [{"type": "scale", "scale": [2.3, 4.1, 5.2]}, {"type": "translation", "translation": [227.7, 122.99, 41.6]}] --- we would have to infer that we should apply an integer offset of [99, 30, 8] using floating point division, which will have inherent precision problems.

The other point I think is relevant is that in a sense the origin in a zarr array is already somewhat arbitrary, since you can leave some chunks unwritten. As long as you want a non-negative origin in all dimensions, you can instead set shape to be the upper bound, and just write data to the region you want to use. (In fact I imagine this may be a common workaround to the present lack of support for a non-zero origin.) You just don't have any standard way to actually indicate what the real origin is. In fact we could imagine a simpler version of zarr that also does not specify shape --- all dimensions are simply unbounded. From this view, allowing both the lower and upper bounds to be recorded for each dimension seems quite natural.

And to be clear, when you use zarr-python to read a sub-region from a zarr array with (as proposed) a non-zero origin, you would still get back a regular zero-origin numpy array that you can operate over. And presumably you could still use xarray or tensorstore or some other "virtual view" library to translate the origin back to 0 if that is what you want for a particular purpose. You would simply have the option to express a more natural origin in the array metadata.

@constantinpape As for changing the behavior of __getitem__, that would indeed be the whole point. I agree that it could be a point of potential confusion for users that are unaware of the support for non-zero origins. But I don't think the learning curve is too steep --- once users are aware of the possibility, it is a natural and simple concept. And I think an explicit non-zero origin is less confusing than the workaround where the data is only stored starting at some origin, but it is not ever explicitly indicated.

constantinpape commented 2 years ago

Thanks for clarifying @jbms. I agree with most of your points, just a few comments:

d-v-b commented 2 years ago

a few points in response to @jbms:

As for changing the behavior of getitem, that would indeed be the whole point. I agree that it could be a point of potential confusion for users that are unaware of the support for non-zero origins. But I don't think the learning curve is too steep --- once users are aware of the possibility, it is a natural and simple concept. And I think an explicit non-zero origin is less confusing than the workaround where the data is only stored starting at some origin, but it is not ever explicitly indicated.

The proposed change in the semantics of __getitem__ would make zarr-python deviate from the norm set by python built-in sequences, numpy arrays, cupy arrays, dask arrays, etc. I think consistency with the rest of the ecosystem is pretty valuable, and I don't think the convenience afforded by offset-aware indexing is worth breaking that consistency.

An alternative approach would be create a new method (like xarray.DataArray.sel()) or a property that itself supports a modified __getitem__ on zarr arrays that supports offset-aware indexing, e.g.

# indexing in translated coordinate space
array.o[-100]

...but if you want this, you should just use xarray (see xarray.DataArray.sel), unless there's a deeply compelling reason not to.

But xarray is a library, not a data format. So you would need some out-of-band mechanism (such as a custom attribute) to communicate the offset. This proposal would essentially just amount to standardizing that attribute.

The xarray / netcdf data model solves this problem by treating coordinates as data, and saving those coordinates. There's nothing stopping the connectomics community from simply adopting the same data model.

I totally agree with standardizing the offset attribute -- everyone who would benefit from it should get together and agree on something (or just use the xarray data model). I just don't think that standard belongs in the core zarr spec, nor do I think zarr-python should change because of the needs of a specific community (of which I am am member!).

jbms commented 2 years ago

@d-v-b It is true that changing the semantics of __getitem__ would make it deviate from the norm set by Python sequences and numpy arrays (which was then copied by some other libraries). In particular it would deviate in that for dimensions with negative origins, a negative index would have to refer to an actual negative position, as opposed to counting from end as is the Python convention.

If I understand correctly, xarray does not in fact currently support translation like this, perhaps because xarray also assumes a zero origin. But it sounds like maybe in the future support would be added to make it easier to implement translation?

As far as the xarray/netcdf data model, I agree that it makes a lot of sense for representing samples over an irregular grid. But representing an integer offset by an array containing consecutive integers does not seem very practical to me. Sure, if the number of dimensions is greater than 1 the total amount of space wasted on these coordinate arrays is small. But there is a larger representational issue: a given program may be designed to work only on normal arrays defined over regular grids, not irregular grids.

For example:

  1. You may have a convolutional model that you want to apply to an image. The model is trained to operate with a given fixed input sampling. If you then apply it to an irregular grid, the results won't be valid. So you end up having to load these extra coordinate arrays, just to verify that they in fact contain consecutive integers.

  2. Consider the case where we actually care about a non-zero origin, where we may have a collection of arrays that we are operating on, each with different, lower/upper bounds, and their bounds may overlap. If we have a simple integer offset we can easily determine how they correspond/overlap. If we have coordinate arrays, we either have to search through the coordinate array to determine the overlap (and the coordinate array representation itself does not guarantee that they even share any of the same coordinate values at all, even if their domains intersect, or we have to again first verify that the coordinate arrays in fact just contain consecutive integers and then manually perform offset-based indexing.

  3. Consider the case where we actually do have an irregular grid and are representing it using coordinate arrays according to the netcdf/xarray data model. But suppose we again have several arrays defined over sub-regions of this irregular grid. If we have offset support, we can simply have a single coordinate array for each full dimension, and then for each sub-region array specify the origin. Otherwise, we would have to actually assign separate dimension names for each sub-region array, duplicate all of the coordinate arrays, and then when processing the data first verify that all of the coordinate arrays are in fact consistent.

I do agree that zarr should not cater to niche interests or functionality specific to just one domain like connectomics. But I think non-zero origins are hardly a niche feature:

The issue with just "standardizing" a custom attribute but otherwise using the zarr format is that it will inevitably lead to confusion: someone will talk about location (x, y, z) in the array, and depending on whether they opened it using zarr-python or tensorstore that will mean a different thing. That is why I did not add support for any such attribute to the tensorstore zarr driver previously. Even if this attribute is somehow standardized, unless we also write the data in such a way that it is incompatible with zarr-python, this potential for confusion exists. (Of course we also have the problem that if they opened it using zarr.jl they will have reversed the coordinates, but I hope we can avoid that problem for zarr v3 as well.)

rabernat commented 2 years ago

But xarray is a library, not a data format. So you would need some out-of-band mechanism (such as a custom attribute) to communicate the offset.

I would say that Xarray is a data model, and that model can be serialized in many different data formats (NetCDF, HDF5, Zarr, etc). It is fundamentally based on the NetCDF data model and leverages heavily the CF conventions. This means a great many custom attributes with clearly specified meanings. For example, the full complexity of the CF coordinate systems is "supported" by Zarr, but completely via standard groups and arrays plus attributes. The metadata are decoded by Xarray (and then re-encoded on write).

I really think that the way to go here architecturally is to leverage the attributes. Then it becomes a social problem: we have to agree on a spec for this feature. In my recent experience, spec development, particularly with multiple stakeholders from different domains, can be very slow moving compared to just implementing the new feature.

I think Zarr needs a whole library of "micro metadata specs" which are composable, of which non-zero origin could be one.

jbms commented 2 years ago

I have not actually ever used NetCDF but I have read about it a fair bit. My understanding though is that it does not actually support non-zero origins as is proposed here, except via coordinate arrays which in my view are an awkward fit (as already discussed). In fact, it seems to me that non-zero origins may be particularly applicable in conjunction with the NetCDF data model, since otherwise you would have to assign separate dimension names to different integer sub-intervals of the same logical dimension, which I think would be undesirable.

I would agree that any purely "additive" functionality may just as well be an extension that is optionally supported by each implementation.

The problem with non-zero origin support is that while conceptually it is extremely simple, it is not purely additive --- it changes the meaning of indexing, which is a core part of zarr. If a user opens the array with zarr-python and zarr-python doesn't support this non-zero origin extension, everything appears to work except they will be using a different coordinate system which will create confusion. This is also a feature where the implementation complexity is extremely low, so it does not impose any significant burden on implementations.

With zarr v3 there is the "must_understand" option for extensions which could be set to true and then it would fail to open with zarr-python. That would avoid the confusion but then we lose interoperability with zarr-python and perhaps most other implementations, which in large part defeats the point of using the zarr format as opposed to some other custom format.

I know that many of us are very accustomed to working with zero-origin arrays --- certainly I was up until a few years ago. I would humbly suggest that if you take a little time to consider them you might find they are actually a pretty natural fit for a lot of use cases.

DennisHeimbigner commented 2 years ago

WRT netcdf, this does affect coordinate variables since I think that the coordinate variable has to be checked to ensure that it has the same offsets as the variable to which it is attached.

DennisHeimbigner commented 2 years ago

let me make sure I understand this the transform of this proposal.

  1. it does not have any impact on the actual storage format
  2. it only applies at read/write time such that given a reference to array(x,y,z) and an offset of (a,b,c) then the reference is converted to array(x-a,y-b,z-c) for purposes of read/write to storage. Correct?
jbms commented 2 years ago

WRT netcdf, this does affect coordinate variables since I think that the coordinate variable has to be checked to ensure that it has the same offsets as the variable to which it is attached.

An alternative would be to instead check that the [inclusive_min, exclusive_max) interval of the coordinate variable contains the [inclusive_min, exclusive_max) interval of the corresponding dimension of the variable to which it is attached, rather than requiring that the intervals are exactly equal. That way, multiple variables defined over different rectangular sub-regions of a coordinate space can share the same dimension names.