Multiscale convention - Githubissues

joshmoore commented 2 years ago

As a part of the CZI EOSS4 grant, B-Open will be working on the development of a cross-community convention for the multiscale representation. (See original use case and proposal). This work targets interoperability between the bioimaging and geospatial use cases and especially between Zarr and Xarray, where https://github.com/pydata/xarray/issues/4118 proposes an extension to the Xarray library which will enable data structures like multiscales.

This issue serves as an overarching reference for the work. Tasks include:

Contributing to the hierarchical datasets implementation (“datatree”)
Investigating the possibility of sharing metadata between scales
Demonstrating usage in downstream libraries (e.g. napari, aicsimageio - see set_level)
Submitting any conventions that are developed to a central registry (still to be defined)
Working with the community for approval of the developed convention via an open process (still to be defined)

Related used cases:

https://ngff.openmicroscopy.org/, bioimaging use-case which developed from zarr-developers/zarr-specs#50
https://github.com/christophenoel/geozarr-spec/blob/main/geozarr-spec.md, GeoZarr extension that was raised as https://github.com/zarr-developers/zarr-specs/issues/124

jhamman commented 2 years ago

👀 I'm very excited to see this moving forward!

cc @TomNicholas (DataTree) and @freeman-lab / @katamartin (ndpyramid/maps)

joshmoore commented 2 years ago

Status update after a hand-full of weeks talking with @aurghs, @TomNicholas, @alexamici, and more recently @malmans2 about this:

Tom's datatree implementation has moved to xarray-contrib (link) and the current plan assumes that consumers will import it directly in the short-term.
datatree seems to cover most known use cases for multiscales like @thewtex's spatial-image-multiscale (see link). Further confirmation of that very welcome! (I haven't yet looked at @d-v-b's xarray-multiscale. Integration may be simplified by other xarray libraries like xarray-dataclasses though they may need slight updates for being datatree aware.
The multiscales layout begun in #50 requires two small changes in order to not require a new specific backend. Adoption of these changes will be further discussed under https://github.com/ome/ngff/issues/48:
1. Dimension names need to be properly specified. (At the moment, that means via _ARRAY_DIMENSIONS but an upcoming issue will try to broaden that.)
2. The individual zarrays for each scale need to be moved to their own zgroup. This is already (mostly) allowed by the specification as @thewtex has shown but needs to be properly encoded.
@aurghs has prototyped a temporary backend that permits loading existing v0.4 multiscale datasets if anyone is interested in experimenting it. Particularly of interest would be operations that consumers might like to apply to all scales at the same time.
As @jbms pointed out on today's community call, there's no relationship between these similarly named dimensions in each of these subgroups, but @dennisheimbigner suggested this might be an interesting upstream (i.e. NetCDF) extension.
At the Zarr level, there are currently no plans for core changes or extensions, but the overall convention could be documented centrally for wider re-use.

jbms commented 2 years ago

@joshmoore Can you help me understand a bit more the status of multiscale proposals and support?

I think it might help if we can distinguish between Python APIs and data formats.

I took a quick look at xarray-datatree and I didn't see anything specifically related to multiscale support. Additionally, my understanding is that on top of the base zarr v2 data model, xarray adds only 2 things:

The _ARRAY_DIMENSIONS attribute to support named dimensions
The convention that an array with the same name as a dimension is a coordinate array.

Both of these features seem to be basically completely orthogonal to multiscale.

The xarray-multiscale library seems to be a purely in-memory thing with no specific data format.

As far as the actual multiscale representation on disk in terms of zarr attributes, it sounds like we are talking about the format proposed here: https://github.com/zarr-developers/zarr-specs/issues/50

There was a lot of discussion on that issue so I'm not clear on whether there is an actual final proposed format. But do I understand correctly that the current proposal does not specify downsample factors or offsets for each level? If so I think it is critical that we rectify that as otherwise we must assume e.g. 2x downsampling and zero offset in all dimensions at each level, which obviously is extremely limiting.

I would propose that we rectify it as follows:

Add to each element of the "datasets" array the following properties:

"downsample_factors": Required. Must be an array of length equal to the number of dimensions. Each element is either an integer or a string of the form "a/b" specifying a rational number, and indicates the downsample factor relative to some "base" coordinate space. Typically the "downsample_factors" for the first dataset will be all 1.
"downsample_offsets": Optional. If specified, must be an array of length equal to the number of dimensions. Each element is either an integer or a string of the form "a/b" specifying a rational number. If not specified, defaults to all 0.

Note: Rational numbers allow non-integer downsample factors to be represented without any loss of precision, but in most cases both the downsample_factors and downsample_offsets will be integers.

At a given downsample level j, for a given dimension i, a given integer coordinate x into dimension i of the array at datasets[j].path corresponds to the interval [datasets[j].downsample_factors[i] * x + datasets[j].downsample_offsets[i], datasets[j].downsample_factors[i] * (x+1) + datasets[j].downsample_offsets[i]) of dimension i of the "base" coordinate space.

Certainly there are other ways to specify this information, but it is critical that we decide on some way to specify it, and I think what I have proposed here is a reasonable and natural choice. Potentially for simplicity the rational number support could be skipped in a first version, and instead integers could be required.

d-v-b commented 2 years ago

@jbms can you explain why explicitly enumerating downscaling factors is preferred over the more explicit approach where each dataset declares its scale and offset?

jbms commented 2 years ago

I thought this proposal is being discussed in the context of zarr rather than OME and was not aware of a proposal to specify the offset and scale other than the OME coordinate transforms.

For applications where you intend to do integer indexing rather than interpolation-based continuous indexing, it is important to be able to represent the relative scales and offsets between levels exactly. Normally when dealing with physical units you would use floating point (and it is often reasonable for that purpose since physical units are surely approximate anyway) which means you must rely on inexact floating point arithmetic to determine relative scales and offsets. I suppose in principle if you represented the offsets and scales using an exact representation like rational numbers it would solve the issue. But in general in the integer indexing case it is the downsample factors rather than the units that may be more relevant so storing the units rather than the downsample factors seems less direct, not more direct.

d-v-b commented 2 years ago

I thought this proposal is being discussed in the context of zarr rather than OME and was not aware of a proposal to specify the offset and scale other than the OME coordinate transforms.

Yes, this is what I had in mind.

But if, as you note, the idea of this issue is to have a multiscale zarr model that works for xarray, OME coordinate transforms are out of scope and probably redundant, since xarray solves the coordinate specification problem by treating coordinates as data. But if this multiscale zarr entails storage explicit coordinates, it's not clear if there's any need for special metadata describing downscaling factors.

jbms commented 2 years ago

My own interest is in a multiscale convention/format generally, not specifically related to xarray, and in particular not tied to the use of coordinate arrays, as I think for arrays defined on regular grids, coordinate arrays are a rather indirect and inconvenient representation for {offset + stride * i : 0 <= i < n}.

rabernat commented 2 years ago

Just chiming in to note that:

xarray does not require the use of coordinate arrays; a dimension can have no index coordinate (and just use logical numpy-style indexing)
after the ongoing index refactor is complete, we will be able to support much more flexible indexes, including the sort of implicit range indexes described above

I'm also not saying that anyone should have to use xarray either. But if Zarr can describe these indexes in its metadata, Xarray should be able to parse them out and turn them into useable index objects (as it does today with dimension coordinates).

We have discussed before (#122) where such index metadata conventions should live: in Zarr user attributes or in a special zarr extension? My question is whether the indexing question is separable from the multiscale convention? Or must these be addressed together?

jbms commented 2 years ago

Thanks for your explanation @rabernat . I think the distinction between zarr user attributes or zarr core attributes is not too important --- it seems quite reasonable to use zarr user attributes, but I would still like to have a standard so that tools can interoperate.

To me it doesn't make sense to define a "multiscale array" as a concept without specifying what the scales actually are. Otherwise you are just saying --- here are some arrays that represent the same data at different scales, but good luck in figuring out how they correspond. I don't see how a tool would make any use of that. So I don't think we can address multiscale without addressing these indexing issues.

But on the other hand perhaps indexing can be addressed before addressing multiscale.

Per the suggestion by @d-v-b that the metadata live in the per-scale array rather than the mutiscale metadata attribute, we could simply move downsample_factors and downsample_offsets as I proposed to be array attributes, perhaps renamed given their more general use. Then xarray could read those attributes and produce an index object (after the refactor) or just materialize an actual coordinate array.

d-v-b commented 2 years ago

So I don't think we can address multiscale without addressing these indexing issues.

Agreed. Downsampling data necessarily generates a new coordinate grid for that data; Consumers need to know the downsampled coordinate grid in order to meaningfully relate a downsampled image back to the original image. A specification that merely encodes "here are some arrays that all have the same dimension names" isn't of much use without encoding the coordinate grid for each image.

joshmoore commented 2 years ago

jbms commented 14 hours ago I think it might help if we can distinguish between Python APIs and data formats.

This is a good point and things are certainly intermingled still. Much of this issue is certainly about the interoperability at the Python level and expressing a desire to support xarray's upcoming hierarchical functionality on the Zarr side. What lessons need to be learned, etc. rather than completely specifying multiscale metadata as we're doing with OME-NGFF.

jbms commented 4 hours ago I thought this proposal is being discussed in the context of zarr rather than OME and was not aware of a proposal to specify the offset and scale other than the OME coordinate transforms.

This issue is definitely for the Zarr side and independent of OME-NGFF, but I don't think we're to the stage of building the entire spec now. One outcome that I think we could shoot for is deciding if and if so where that work will be taken on.

jbms commented 2 hours ago My own interest is in a multiscale convention/format generally, not specifically related to xarray, and in particular not tied to the use of coordinate arrays

Assuming we develop an extension/convention here, one thing that occurs to me is how balance the metadata in the OME-NGFF spec. Is there overlap? Conversion? ... Confusion?

jbms commented 2 hours ago To me it doesn't make sense to define a "multiscale array" as a concept without specifying what the scales actually are.

It would be interesting to hear what others have to say on that front. @christophenoel? At least with the xarray api, conceivably there are some operations that would still be useful even without the metadata.

jbms commented 2 years ago

Thanks for your comments @joshmoore and @d-v-b.

Thinking about this a bit more, I think it might be best if OME can just be made to also work for the discrete indexing case (i.e. without needing to use floating point arithmetic) --- then we could just have a single multiscale spec.

The discrete indexing case would just be a special case of the general multiscale array/view where it so happens that no floating point transforms are required.

Here is an idea:

An OME multiscale has a single associated coordinate space, which typically matches the (integer-indexed) coordinate space of the base array.

Coordinate spaces are basically the same as in the current OME spec, except that their units can have arbitrary coefficients, not just powers of 10. That allows us to have coordinate spaces where we can still do useful integer indexing.

For example:

/my_multiscale (group)
attributes:
  {"ome_multiscale": {
    "datasets": ["s0", "s1", "s2"]
  },
   "ome_space": [
      {"name": "x", "unit": "4 nm"},
      {"name": "y", "unit": "5 nm"},
   ]}

/my_multiscale/s0 (array)
attributes:
  {"ome_space": "/my_multiscale"}

/my_multiscale/s1 (array)
attributes:
  {"ome_transformations": {
     "my_transform": {
        "input_space": "/my_multiscale",
        "output_space": ".",
        "type": "scale",
        "scales": [2, 2]
     }
    }
  }

/my_multiscale/s2 (array)
attributes:
  {"ome_transformations": {
    "my_transform": {
      "input_space": "/my_multiscale",
      "output_space": ".",
      "type": "scale",
      "scales": [4, 4]}
    }
  }

For each dataset listed in "datasets", it is required to either have the same coordinate space as the multiscale (as is the case for s0) or there must be a transformation defined between its coordinate space (the "self" coordinate space is indicated by the relative path ".") and the multiscale's coordinate space.

I think this representation also addresses the concern by @d-v-b that each array stand on its own --- if you open /my_multiscale/s1 on its own, you will see that there is a coordinate transform to the "/my_multiscale" coordinate space and can view it under that space if you wish.

joshmoore commented 2 years ago

jbms commented 2 hours ago Thinking about this a bit more, I think it might be best if OME can just be made to also work for the discrete indexing case (i.e. without needing to use floating point arithmetic) --- then we could just have a single multiscale spec.

A benefit of unifying the two in one spec would be the possibility of extracting it wholesale in the future for wider adoption.

However, I note that we've now diverted this issue away from its initial purpose of capturing the ongoing integration work with xarray. Can we continue your suggestion in https://github.com/zarr-developers/zarr-specs/issues/125#issuecomment-1064485521, @jbms, along with @bogovicj's work in https://github.com/ome/ngff/issues/101 ?

christophenoel commented 2 years ago

Only for information: in last GeoZarr version we decided to rely on the 'historical' zoom level conventions: defacto standard level 0 as 256x256 pixels covering the entire world (and default pseudo Plate Carre non-projection), and scale doubled on each level as per https://wiki.openstreetmap.org/wiki/Zoom_levels (see also ArcGIS note about A brief history of zoom levels: As https://developers.arcgis.com/documentation/mapping-apis-and-services/reference/zoom-levels-and-scale/ ).

The group name indicates the zoom level. These conventions are simple, well supported by viewers, and very similar to the overviews mechanism in Cloud Optimised GeoTiff (COG).

I will publish very soon a demonstration video, and an OpenLayers extension that supports GeoZarr including multiscaling.

GeoZarr multiscales: https://github.com/christophenoel/geozarr-spec/blob/main/geozarr-spec.md#multiscales

bogovicj commented 2 years ago

@joshmoore @jbms ,

I'll include something on discrete indexing in the multiscale section next time I edit the ome-zarr spec. Thanks for the mention!

christophenoel commented 2 years ago

Hi. Here below are the documentation resources I have mentioned:

Project Presentation (at WGISS-53) GeoZarr Data Store - Context of the ESA GSTP project
Project Presentation (at DAP) Hyperspectral Data Store and Access Project
Demo: GeoZarr Visual Portrayals and OpenLayers extension
Demo: GeoZarr Fast Time Series Plotting
Demo: GeoZarr Compute and plot NDWI index at runtime
Demo: GeoZarr Catalogue Integration
Demo: GeoZarr Serverless Visualisation and Pixel-Based Access

zarr-developers / zarr-specs

Multiscale convention #125