zarr-developers / zarr-specs

Zarr core protocol for storage and retrieval of N-dimensional typed arrays
https://zarr-specs.readthedocs.io/
Other
81 stars 25 forks source link

Multiscale convention #125

Open joshmoore opened 2 years ago

joshmoore commented 2 years ago

As a part of the CZI EOSS4 grant, B-Open will be working on the development of a cross-community convention for the multiscale representation. (See original use case and proposal). This work targets interoperability between the bioimaging and geospatial use cases and especially between Zarr and Xarray, where https://github.com/pydata/xarray/issues/4118 proposes an extension to the Xarray library which will enable data structures like multiscales.

This issue serves as an overarching reference for the work. Tasks include:

Related used cases:

jhamman commented 2 years ago

👀 I'm very excited to see this moving forward!

cc @TomNicholas (DataTree) and @freeman-lab / @katamartin (ndpyramid/maps)

joshmoore commented 2 years ago

Status update after a hand-full of weeks talking with @aurghs, @TomNicholas, @alexamici, and more recently @malmans2 about this:

jbms commented 2 years ago

@joshmoore Can you help me understand a bit more the status of multiscale proposals and support?

I think it might help if we can distinguish between Python APIs and data formats.

I took a quick look at xarray-datatree and I didn't see anything specifically related to multiscale support. Additionally, my understanding is that on top of the base zarr v2 data model, xarray adds only 2 things:

Both of these features seem to be basically completely orthogonal to multiscale.

The xarray-multiscale library seems to be a purely in-memory thing with no specific data format.

As far as the actual multiscale representation on disk in terms of zarr attributes, it sounds like we are talking about the format proposed here: https://github.com/zarr-developers/zarr-specs/issues/50

There was a lot of discussion on that issue so I'm not clear on whether there is an actual final proposed format. But do I understand correctly that the current proposal does not specify downsample factors or offsets for each level? If so I think it is critical that we rectify that as otherwise we must assume e.g. 2x downsampling and zero offset in all dimensions at each level, which obviously is extremely limiting.

I would propose that we rectify it as follows:


Add to each element of the "datasets" array the following properties:

Note: Rational numbers allow non-integer downsample factors to be represented without any loss of precision, but in most cases both the downsample_factors and downsample_offsets will be integers.

At a given downsample level j, for a given dimension i, a given integer coordinate x into dimension i of the array at datasets[j].path corresponds to the interval [datasets[j].downsample_factors[i] * x + datasets[j].downsample_offsets[i], datasets[j].downsample_factors[i] * (x+1) + datasets[j].downsample_offsets[i]) of dimension i of the "base" coordinate space.


Certainly there are other ways to specify this information, but it is critical that we decide on some way to specify it, and I think what I have proposed here is a reasonable and natural choice. Potentially for simplicity the rational number support could be skipped in a first version, and instead integers could be required.

d-v-b commented 2 years ago

@jbms can you explain why explicitly enumerating downscaling factors is preferred over the more explicit approach where each dataset declares its scale and offset?

jbms commented 2 years ago

I thought this proposal is being discussed in the context of zarr rather than OME and was not aware of a proposal to specify the offset and scale other than the OME coordinate transforms.

For applications where you intend to do integer indexing rather than interpolation-based continuous indexing, it is important to be able to represent the relative scales and offsets between levels exactly. Normally when dealing with physical units you would use floating point (and it is often reasonable for that purpose since physical units are surely approximate anyway) which means you must rely on inexact floating point arithmetic to determine relative scales and offsets. I suppose in principle if you represented the offsets and scales using an exact representation like rational numbers it would solve the issue. But in general in the integer indexing case it is the downsample factors rather than the units that may be more relevant so storing the units rather than the downsample factors seems less direct, not more direct.

d-v-b commented 2 years ago

I thought this proposal is being discussed in the context of zarr rather than OME and was not aware of a proposal to specify the offset and scale other than the OME coordinate transforms.

Yes, this is what I had in mind.

But if, as you note, the idea of this issue is to have a multiscale zarr model that works for xarray, OME coordinate transforms are out of scope and probably redundant, since xarray solves the coordinate specification problem by treating coordinates as data. But if this multiscale zarr entails storage explicit coordinates, it's not clear if there's any need for special metadata describing downscaling factors.

jbms commented 2 years ago

My own interest is in a multiscale convention/format generally, not specifically related to xarray, and in particular not tied to the use of coordinate arrays, as I think for arrays defined on regular grids, coordinate arrays are a rather indirect and inconvenient representation for {offset + stride * i : 0 <= i < n}.

rabernat commented 2 years ago

Just chiming in to note that:

I'm also not saying that anyone should have to use xarray either. But if Zarr can describe these indexes in its metadata, Xarray should be able to parse them out and turn them into useable index objects (as it does today with dimension coordinates).

We have discussed before (#122) where such index metadata conventions should live: in Zarr user attributes or in a special zarr extension? My question is whether the indexing question is separable from the multiscale convention? Or must these be addressed together?

jbms commented 2 years ago

Thanks for your explanation @rabernat . I think the distinction between zarr user attributes or zarr core attributes is not too important --- it seems quite reasonable to use zarr user attributes, but I would still like to have a standard so that tools can interoperate.

To me it doesn't make sense to define a "multiscale array" as a concept without specifying what the scales actually are. Otherwise you are just saying --- here are some arrays that represent the same data at different scales, but good luck in figuring out how they correspond. I don't see how a tool would make any use of that. So I don't think we can address multiscale without addressing these indexing issues.

But on the other hand perhaps indexing can be addressed before addressing multiscale.

Per the suggestion by @d-v-b that the metadata live in the per-scale array rather than the mutiscale metadata attribute, we could simply move downsample_factors and downsample_offsets as I proposed to be array attributes, perhaps renamed given their more general use. Then xarray could read those attributes and produce an index object (after the refactor) or just materialize an actual coordinate array.

d-v-b commented 2 years ago

So I don't think we can address multiscale without addressing these indexing issues.

Agreed. Downsampling data necessarily generates a new coordinate grid for that data; Consumers need to know the downsampled coordinate grid in order to meaningfully relate a downsampled image back to the original image. A specification that merely encodes "here are some arrays that all have the same dimension names" isn't of much use without encoding the coordinate grid for each image.

joshmoore commented 2 years ago

jbms commented 14 hours ago I think it might help if we can distinguish between Python APIs and data formats.

This is a good point and things are certainly intermingled still. Much of this issue is certainly about the interoperability at the Python level and expressing a desire to support xarray's upcoming hierarchical functionality on the Zarr side. What lessons need to be learned, etc. rather than completely specifying multiscale metadata as we're doing with OME-NGFF.

jbms commented 4 hours ago I thought this proposal is being discussed in the context of zarr rather than OME and was not aware of a proposal to specify the offset and scale other than the OME coordinate transforms.

This issue is definitely for the Zarr side and independent of OME-NGFF, but I don't think we're to the stage of building the entire spec now. One outcome that I think we could shoot for is deciding if and if so where that work will be taken on.

jbms commented 2 hours ago My own interest is in a multiscale convention/format generally, not specifically related to xarray, and in particular not tied to the use of coordinate arrays

Assuming we develop an extension/convention here, one thing that occurs to me is how balance the metadata in the OME-NGFF spec. Is there overlap? Conversion? ... Confusion?

jbms commented 2 hours ago To me it doesn't make sense to define a "multiscale array" as a concept without specifying what the scales actually are.

It would be interesting to hear what others have to say on that front. @christophenoel? At least with the xarray api, conceivably there are some operations that would still be useful even without the metadata.

jbms commented 2 years ago

Thanks for your comments @joshmoore and @d-v-b.

Thinking about this a bit more, I think it might be best if OME can just be made to also work for the discrete indexing case (i.e. without needing to use floating point arithmetic) --- then we could just have a single multiscale spec.

The discrete indexing case would just be a special case of the general multiscale array/view where it so happens that no floating point transforms are required.

Here is an idea:

An OME multiscale has a single associated coordinate space, which typically matches the (integer-indexed) coordinate space of the base array.

Coordinate spaces are basically the same as in the current OME spec, except that their units can have arbitrary coefficients, not just powers of 10. That allows us to have coordinate spaces where we can still do useful integer indexing.

For example:

/my_multiscale (group)
attributes:
  {"ome_multiscale": {
    "datasets": ["s0", "s1", "s2"]
  },
   "ome_space": [
      {"name": "x", "unit": "4 nm"},
      {"name": "y", "unit": "5 nm"},
   ]}

/my_multiscale/s0 (array)
attributes:
  {"ome_space": "/my_multiscale"}

/my_multiscale/s1 (array)
attributes:
  {"ome_transformations": {
     "my_transform": {
        "input_space": "/my_multiscale",
        "output_space": ".",
        "type": "scale",
        "scales": [2, 2]
     }
    }
  }

/my_multiscale/s2 (array)
attributes:
  {"ome_transformations": {
    "my_transform": {
      "input_space": "/my_multiscale",
      "output_space": ".",
      "type": "scale",
      "scales": [4, 4]}
    }
  }

For each dataset listed in "datasets", it is required to either have the same coordinate space as the multiscale (as is the case for s0) or there must be a transformation defined between its coordinate space (the "self" coordinate space is indicated by the relative path ".") and the multiscale's coordinate space.

I think this representation also addresses the concern by @d-v-b that each array stand on its own --- if you open /my_multiscale/s1 on its own, you will see that there is a coordinate transform to the "/my_multiscale" coordinate space and can view it under that space if you wish.

joshmoore commented 2 years ago

jbms commented 2 hours ago Thinking about this a bit more, I think it might be best if OME can just be made to also work for the discrete indexing case (i.e. without needing to use floating point arithmetic) --- then we could just have a single multiscale spec.

A benefit of unifying the two in one spec would be the possibility of extracting it wholesale in the future for wider adoption.

However, I note that we've now diverted this issue away from its initial purpose of capturing the ongoing integration work with xarray. Can we continue your suggestion in https://github.com/zarr-developers/zarr-specs/issues/125#issuecomment-1064485521, @jbms, along with @bogovicj's work in https://github.com/ome/ngff/issues/101 ?

christophenoel commented 2 years ago

Only for information: in last GeoZarr version we decided to rely on the 'historical' zoom level conventions: defacto standard level 0 as 256x256 pixels covering the entire world (and default pseudo Plate Carre non-projection), and scale doubled on each level as per https://wiki.openstreetmap.org/wiki/Zoom_levels (see also ArcGIS note about A brief history of zoom levels: As https://developers.arcgis.com/documentation/mapping-apis-and-services/reference/zoom-levels-and-scale/ ).

The group name indicates the zoom level. These conventions are simple, well supported by viewers, and very similar to the overviews mechanism in Cloud Optimised GeoTiff (COG).

I will publish very soon a demonstration video, and an OpenLayers extension that supports GeoZarr including multiscaling.

GeoZarr multiscales: https://github.com/christophenoel/geozarr-spec/blob/main/geozarr-spec.md#multiscales

bogovicj commented 2 years ago

@joshmoore @jbms ,

I'll include something on discrete indexing in the multiscale section next time I edit the ome-zarr spec. Thanks for the mention!

christophenoel commented 2 years ago

Hi. Here below are the documentation resources I have mentioned: