Extension proposal: multiscale arrays v0.1

zarr-developers / zarr-specs

Zarr core protocol for storage and retrieval of N-dimensional typed arrays

https://zarr-specs.readthedocs.io/

Creative Commons Attribution 4.0 International

89 stars 28 forks source link

Extension proposal: multiscale arrays v0.1 #50

Closed joshmoore closed 4 years ago

joshmoore commented 4 years ago

This issue has been migrated to an image.sc topic after the 2020-05-06 community discussion. Authors are still encouraged to make use of the specification in their own libraries. As the v3 extension mechanism matures, the specification will be updated and registered as appropriate. Feedback and request changes are welcome either on this repository or on image.sc.

As a first draft of support for the multiscale use-case (https://github.com/zarr-developers/zarr-specs/issues/23), this issue proposes an intermediate nomenclature for describing groups of Zarr arrays which are scaled down versions of one another, e.g.:

example/
├── 0    # Full-sized array
├── 1    # Scaled down 0, e.g. 0.5; for images, in the X&Y dimensions
├── 2    # Scaled down 1, ...
├── 3    # Scaled down 2, ...
└── 4    # Etc.

This layout was independently developed in a number of implementations and has since been implemented in others, including:

Using a common metadata representation across implementations:

fosters a common vocabulary between existing implementations
enables other implementations to reliably detect multiscale arrays
permits the upgrade of v0.1 arrays to future versions of this or other extension
tests this extension for limitations against multiple use cases

A basic example of the metadata that is added to the containing Zarr group is seen here:

{
  “multiscales”: [
    {
      “datasets” : [
          {"path": "0"},
          {"path": "1"},
          {"path": "2"},
          {"path": "3"},
          {"path": "4"}
        ]
      “version” : “0.1”
    }
     // See the detailed example below for optional metadata
  ]
}

Process

An RFC process for Zarr does not yet exist. Additionally, the v3 spec is a work-in-progress. However, since the implementations listed above as well as others are already being developed, I'd propose that if a consensus can be reached here, this issue should be turned into an .rst file similar to those in the v3 branches (e.g. filters) and used as a temporary spec for defining arrays with the understanding that this a prototype intended to be amended and brought into the general extension mechanism as it develops.

I'd welcome any suggestions/feedback, but especially around:

Better terms for "multiscale" and "series"
The most useful enum values
Is this already too complicated? (Limit to one series per group?) or on the flip side:
Are there existing use cases that aren't supported? (Note: I'm aware of some examples like BDV's N5 format but I'd suggest they are higher-level than just "multiscale arrays".)

Deadline for a first round of comments: ~~March 15, 2020~~ Deadline for a second round of comments: April 15, 2020

Detailed example

Color key (according to https://www.ietf.org/rfc/rfc2119.txt):

- MUST     : If these values are not present, the multiscale series will not be detected.
! SHOULD   : Missing values may cause issues in future versions.
+ MAY      : Optional values which can be readily omitted.
# UNPARSED : When updating between versions, no transformation will be performed on these values.

Color-coded example:

-{
-  "multiscales": [
-    {
!      "version": "0.1",
!      "name": "example",
-      "datasets": [
-        {"path": "0"},
-        {"path": "1"},
-        {"path": "2"}
-      ],
!      "type": "gaussian",
!      "metadata": {
+        "method":
#          "skiimage.transform.pyramid_gaussian",
+        "version":
#          "0.16.1",
+        "args":
#          [true],
+        "kwargs":
#          {"multichannel": true}
!      }
-    }
-  ]
-}

Explanation

Multiple multiscale series of datasets can be present in a single group.
By convention, the first multiscale should be chosen if all else is equal.
Alternatively, a multiscale can be chosen by name or with slightly more effort, but the zarray metadata like chunk size.
The paths to the arrays are ordered from largest to smallest.
These paths could potentially point to datasets in other groups via “../foo/0” in the future. For now, the identifiers MUST be local to the annotated group.
These values SHOULD (MUST?) come from the enumeration below.
The metadata example is taken from https://scikit-image.org/docs/dev/api/skimage.transform.html#skimage.transform.pyramid_reduce

Type enumeration:

gaussian, e.g. skimage.transform.pyramid_gaussian
laplacian, e.g. skimage.transform.pyramid_laplacian
reduce, e.g. skimage.transform.pyramid_laplacian
pick, e.g. SimpleImageScaler's “top-left” strategy
Suggestions welcome

Sample code

#!/usr/bin/env python
import argparse
import zarr
import numpy as np
from skimage import data
from skimage.transform import pyramid_gaussian, pyramid_laplacian

parser = argparse.ArgumentParser()
parser.add_argument("zarr_directory")
ns = parser.parse_args()

# 1. Setup of data and Zarr directory
base = np.tile(data.astronaut(), (2, 2, 1))

gaussian = list(
    pyramid_gaussian(base, downscale=2, max_layer=4, multichannel=True)
)

laplacian = list(
    pyramid_laplacian(base, downscale=2, max_layer=4, multichannel=True)
)

store = zarr.DirectoryStore(ns.zarr_directory)
grp = zarr.group(store)
grp.create_dataset("base", data=base)

# 2. Generate datasets
series_G = []
for g, dataset in enumerate(gaussian):
    if g == 0:
        path = "base"
    else:
        path = "G%s" % g
        grp.create_dataset(path, data=gaussian[g])
    series_G.append({"path": path})

series_L = []
for l, dataset in enumerate(laplacian):
    if l == 0:
        path = "base"
    else:
        path = "L%s" % l
        grp.create_dataset(path, data=laplacian[l])
    series_L.append({"path": path})

# 3. Generate metadata block
multiscales = []
for name, series in (("gaussian", series_G),
                     ("laplacian", series_L)):
    multiscale = {
      "version": "0.1",
      "name": name,
      "datasets": series,
      "type": name,
    }
    multiscales.append(multiscale)
grp.attrs["multiscales"] = multiscales

which results in a .zattrs file of the form:

{
    "multiscales": [
        {
            "datasets": [
                {
                    "path": "base"
                },
                {
                    "path": "G1"
                },
                {
                    "path": "G2"
                },
                {
                    "path": "G3"
                },
                {
                    "path": "G4"
                }
            ],
            "name": "gaussian",
            "type": "gaussian",
            "version": "0.1"
        },
        {
            "datasets": [
                {
                    "path": "base"
                },
                {
                    "path": "L1"
                },
                {
                    "path": "L2"
                },
                {
                    "path": "L3"
                },
                {
                    "path": "L4"
                }
            ],
            "name": "laplacian",
            "type": "laplacian",
            "version": "0.1"
        }
    ]
}

and the following on-disk layout:

/var/folders/z5/txc_jj6x5l5cm81r56ck1n9c0000gn/T/tmp77n1ga3r.zarr
├── G1
│   ├── 0.0.0
...
│   └── 3.1.1
├── G2
│   ├── 0.0.0
│   ├── 0.1.0
│   ├── 1.0.0
│   └── 1.1.0
├── G3
│   ├── 0.0.0
│   └── 1.0.0
├── G4
│   └── 0.0.0
├── L1
│   ├── 0.0.0
...
│   └── 3.1.1
├── L2
│   ├── 0.0.0
│   ├── 0.1.0
│   ├── 1.0.0
│   └── 1.1.0
├── L3
│   ├── 0.0.0
│   └── 1.0.0
├── L4
│   └── 0.0.0
└── base
    ├── 0.0.0
...
    └── 1.1.1

9 directories, 54 files

Revision	Source	Date	Description
6	External feedback on twitter and image.sc	2020-05-06	Remove "scale"; clarify ordering and naming
5	External bug report from @mtbc	2020-04-21	Fixed error in the simple example
4	https://github.com/zarr-developers/zarr-specs/issues/50#issuecomment-599782137	2020-04-08	Changed "name" to "path"
3	Discussions up through https://github.com/zarr-developers/zarr-specs/issues/50#issuecomment-599782137	2020-04-01	Updated naming schema
2	https://github.com/zarr-developers/zarr-specs/issues/50#issuecomment-595505162	2020-03-07	Fixed typo
1	@joshmoore	2020-03-06	Original text from in person discussions

Thanks to @ryan-williams, @jakirkham, @freeman-lab, @petebankhead, @jni, @sofroniewn, @chris-allan, and anyone else whose GitHub account I've forgotten for the preliminary discussions.

manzt commented 4 years ago

I'm really happy see this. We also used a similar layout for storing pyramids initially to that proposed above and it's fantastic to see this formalized.

I'm curious about the decision to store the base array in the same group as the downsampled levels. We initially did the same, but then moved towards a structure separating the two:

└── example/
    ├── .zgroup
    ├── base
    │   ├── .zarray
    │   ├── .zattrs
    │   ├── 0.0.0
    │   └── ...etc
    └── sub-resolutions/
        ├── .zgroup
        ├── .zattrs
        ├── 01/
        │   ├── .zarray
        │   ├── 0.0.0
        │   └── ...etc
        └── 02/
            ├── .zarray
            ├── 0.0.0
            └── ...etc

as a more general "image" format in zarr. One could expect to find a "base" array and then check for the "sub-resolutions" group to determine if it is a pyramid or not. We thought this structure would allow for other types of data (e.g. segmentation) to be store along side the base array. Again, thanks for the work here in formalizing this!

joshmoore commented 4 years ago

Thanks, @manzt. Let's see if there are more votes for the deeper representation. It's certainly also what I was originally thinking about in #23. The downside is that one likely needs metadata on all the datasets pointing up and down the hierarchy in order to support detection of the sequence from any scale. It's the other major design layout I can think of. (If anyone has more, those would be very welcome.)

sofroniewn commented 4 years ago

@joshmoore amazing to see this kick off. A couple short comments

If looking for alternate names I'd consider multiresolution, but multiscale definitely works for me. We have been using pyramid at in napari but are thinking of changing (see here https://github.com/napari/napari/issues/1019#issuecomment-595325260 and we can try and go with whatever the majority likes).
One thing that has come up for me and a list of "scales" is that when you have large volumetric timeseries, where you might create a pyramid for each timepoint, some of the axes are unscaled, so you really need to look at the shapes of the arrays to do the right thing. I see that the field is optional but I wonder how much is gained from it (I'm also not opposed though, and would probably find usage from it, but wanted to put out this caveat)
Multiple series per group is probably good flexibility to have, say if you have two independent multiscale datasets you want to put in the same group, it lets the group abstraction remain separate from the multiscale details.
The concept of having base + subresolutions like @manzt proposes is intriguing to me too. Ultimately for visualization purposes I want something like a single list of arrays so I guess I find that representation little simpler, but I can construct that from the later representation if I know the data is multiscale and maybe it is nice to keep that a little separate. I will think on it more, curious what others say.

d-v-b commented 4 years ago

Glad to see the discussion here. Some thoughts:

Philosophically, I'd like to suggest a few constraints (both of which are satisfied by @joshmoore's proposal, but not by a lot of other existing multiscale image schemas): First, individual images should be portable -- wherever possible, images should not have metadata/attributes that indicates their role in a multiscale representation, so that they can be copied somewhere else and viewed on their own without losing context. Second, no magic dataset names like s0, s1, etc. The use of the list of datasets in @joshmoore's group attributes solves this problem.
Personally I'm not a fan of putting the base image at a different level of the hierarchy, since most software i've seen assumes that all the different scale levels will all be elements in the same collection. @manzt you suggest that you adopted this structure in order to facilitate checking for a multiscale representation, but I think this is a job for group metadata, not hierarchy.
~~For simplicity, I would propose a restriction of one multiscale representation per group. Groups are cheap; if you to represent 2 multiscale images, then make 2 groups.~~ (This doesn't work for multiple multiscale representations that use the same base image, e.g. gaussian and laplacian pyramids). The use of the series group metadata in @joshmoore's proposal handles this nicely.
A multiscale image is a collection of images. Accordingly, the "multiscaleness" should be a group attribute that lists the images in the collection, which is how @joshmoore does it in the draft prosposal. I would add some dataset-specific information to the group attributes: software that consumes multiscale images needs to know about how the spatial properties of each image, and on cloud storage it can be cumbersome to query each image individually; so for convenience this image metadata could also be in the group attributes that describe the multiscale representation. I think explicitly listing the transform attributes of each image is safer than just listing "scales", as long as the transform attributes of each image are small.

Here's example metadata that implements this concept. The specifics of the "transform attributes" don't really matter -- this could be an affine transform, or something fancier. But I think the basic idea of putting the spatial information of each dataset in the group attributes is solid.

// group attributes
{
  “multiscale”: {
    “version” : “0.1”,
        “datasets” : [“0” : {transform attributes of 0}, 
                      “1” : {transform attributes of 1}, 
                      “2” : {transform attributes of 2}, 
                      “3” : {transform attributes of 3}, 
                      "4" : {transform attributes of 4}]
    }
    // optional stuff
  }
}

// example transform attributes of dataset 0

"transform" : {
    "offset" : {"X" : 0, "Y" : 0, "Z" : 0},
    "scale" : {"X" : 1, "Y" : 1, "Z" : 1},
    "units" : {"X" : "nm", "Y" : "nm", "Z" : "nm"}
} 

// example spatial attributes of dataset 1

"transform" : {
    "offset" : {"X" : .5, "Y" : .5, "Z" : 0},
    "scale" : {"X" : 2, "Y" : 2, "Z" : 1},
    "units" : {"X" : "nm", "Y" : "nm", "Z" : "nm"}
}

For posterity, I've written about this issue (as it pertains to the data our group works with) here

manzt commented 4 years ago

@sofroniewn The concept of having base + subresolutions like @manzt proposes is intriguing to me too. Ultimately for visualization purposes I want something like a single list of arrays so I guess I find that representation little simpler, but I can construct that from the later representation if I know the data is multiscale and maybe it is nice to keep that a little separate. I will think on it more, curious what others say.

I generally have the same feelings. I'm for the simplicity of the current proposal, and I wonder if my suggestion adds an extra layer of complexity unnecessarily.

@d-v-b For simplicity, I would propose a restriction of one multiscale representation per group. Groups are cheap; if you to represent 2 multiscale images, then make 2 groups.

Wouldn't this require copying the base image into a separate group? Perhaps I'm misunderstanding.

d-v-b commented 4 years ago

Wouldn't this require copying the base image into a separate group? Perhaps I'm misunderstanding.

The base image would be in the same group with the downscaled versions. So on the file system, it would look like this:

└── example/
    ├── .zgroup
    ├── base
    │   ├── .zarray
    │   ├── .zattrs
    │   ├── 0.0.0
    │   └── ...etc
    ├── base_downscaled
    │   ├── .zarray
    │   ├── .zattrs
    │   ├── 0.0.0
    │   └── ...etc
    ...etc

manzt commented 4 years ago

Apologies, I thought you were suggesting that separate groups should be created for different sampling of the same base image (e.g. gaussian and laplacian).

d-v-b commented 4 years ago

@manzt this is actually my mistake -- I was not thinking at all about the use case where the same base image is used for multiple pyramids, and I agree that copying data is not ideal. I will remove / amend the "one multiscale representation per group" part of my proposal above.

thewtex commented 4 years ago

I would add some dataset-specific information to the group attributes: software that consumes multiscale images needs to know about how the spatial properties of each image, and on cloud storage it can be cumbersome to query each image individually;

Adding to the practical importance here: the spatial position of the first pixel is shifted in subresolutions, and the physical spacing between pixels changes also. This must be accounted for during visualization or analysis when other datasets, e.g. other images or segmentations, come into play. If this metadata is readily and independently available for every subresolution, i.e. scale factors do not need to be fetched and computations made, each subresolution image can be used independently, effortlessly, and without computational overhead.

One option is to build on the model implied by storing images in the Xarray project data structures, which has Zarr support. This enables storing metadata such as the position of the first pixel, the spacing between pixels, and identification of the array dimensions, e.g., x, y, t, so that data can be used and passed through processing pipelines and visualization tools. This is helpful because it enables distributed computing via Dask and machine learning [2] via the scikit-learn API. Xarray has broad community adoption, and it is gaining more traction lately. Of course, a model that is compatible with Xarray does not require Xarray to use the data. On the other hand, Xarray coords have more flexibility than what is required for pixels sampled on a uniform rectilinear grid, and this adds a little complexity to the layout.

Generated from this example, here is what it looks like:

.
├── level_1.zarr
│   ├── rec20160318_191511_232p3_2cm_cont__4097im_1500ms_ML17keV_6
│   │   ├── 0.0.0
│   │   ├── 0.0.1
....
│   │   ├── 9.9.9
│   │   ├── .zarray
│   │   └── .zattrs
│   ├── x
│   │   ├── 0
│   │   ├── .zarray
│   │   └── .zattrs
│   ├── y
│   │   ├── 0
│   │   ├── .zarray
│   │   └── .zattrs
│   ├── z
│   │   ├── 0
│   │   ├── .zarray
│   │   └── .zattrs
│   ├── .zattrs
│   ├── .zgroup
│   └── .zmetadata
├── level_2.zarr
│   ├── rec20160318_191511_232p3_2cm_cont__4097im_1500ms_ML17keV_6
│   │   ├── 0.0.0
│   │   ├── 0.0.1

│   │   ├── 8.9.9
│   │   ├── .zarray
│   │   └── .zattrs
│   ├── x
│   │   ├── 0
│   │   ├── .zarray
│   │   └── .zattrs
│   ├── y
│   │   ├── 0
│   │   ├── .zarray
│   │   └── .zattrs
│   ├── z
│   │   ├── 0
│   │   ├── .zarray
│   │   └── .zattrs
│   ├── .zattrs
│   ├── .zgroup
│   └── .zmetadata
....
├── rec20160318_191511_232p3_2cm_cont__4097im_1500ms_ML17keV_6
│   ├── 0.0.0
│   ├── 0.0.1
...
│   ├── 9.9.9
│   ├── .zarray
│   └── .zattrs
├── x
│   ├── 0
│   ├── .zarray
│   └── .zattrs
├── y
│   ├── 0
│   ├── .zarray
│   └── .zattrs
├── z
│   ├── 0
│   ├── .zarray
│   └── .zattrs
├── .zattrs
├── .zgroup
└── .zmetadata

34 directories, 62359 files

This is the layout generated by xarray.DataSet.to_zarr. It does not mean that Xarray has to be used to read and write. But, it would mean that Zarr images would be extremely easy to use via xarray. In this case, .zmetadata is generated on each subresolution so it can be used entirely independently. Due to how Xarray/Zarr handles coords, x, y, are one dimensional arrays. This results in every resolution having its own group.

The metadata looks like this:

{
    "metadata": {
        ".zattrs": {
            "_MULTISCALE_LEVELS": [
                "",
                "level_1.zarr",
                "level_2.zarr",
                "level_3.zarr",
                "level_4.zarr",
                "level_5.zarr",
                "level_6.zarr"
            ],
            "_SPATIAL_IMAGE": "rec20160318_191511_232p3_2cm_cont__4097im_1500ms_ML17keV_6"
        },

Open for rest

``` ".zgroup": { "zarr_format": 2 }, "level_1.zarr/.zattrs": {}, "level_1.zarr/.zgroup": { "zarr_format": 2 }, "level_1.zarr/rec20160318_191511_232p3_2cm_cont__4097im_1500ms_ML17keV_6/.zarray": { "chunks": [ 64, 64, 64 ], "compressor": { "blocksize": 0, "clevel": 5, "cname": "zstd", "id": "blosc", "shuffle": 0 }, "dtype": "|u1", "fill_value": null, "filters": null, "order": "C", "shape": [ 1080, 1280, 1280 ], "zarr_format": 2 }, "level_1.zarr/rec20160318_191511_232p3_2cm_cont__4097im_1500ms_ML17keV_6/.zattrs": { "_ARRAY_DIMENSIONS": [ "z", "y", "x" ], "direction": [ [ 1.0, 0.0, 0.0 ], [ 0.0, 1.0, 0.0 ], [ 0.0, 0.0, 1.0 ] ], "units": "\u03bcm" }, "level_1.zarr/x/.zarray": { "chunks": [ 1280 ], "compressor": { "blocksize": 0, "clevel": 5, "cname": "lz4", "id": "blosc", "shuffle": 1 }, "dtype": "

Here _MULTISCALE_LEVELS prevents the need to hardcode the identifiers as suggested by @d-v-b @manzt , but it could be renamed to multiscale, etc. _ARRAY_DIMENSIONS is the key that Xarray uses in Zarr files to identify the dims.

This example is generated with itk, but it could also just as easily be generated with scikit-image, or dask-image via [1] (work in progress) or pyimagej.

sofroniewn commented 4 years ago

Thanks for the link to that example @thewtex! Conforming with xarray.DataSet.to_zarr where possible seems reasonable to me too.

@constantinpape, @bogovicj, @axtimwalde might also be interested in weighing in.

jni commented 4 years ago

👍 to flat vs hierarchical representation. Also 👍 to "multiscale".

I also like the constraint that the sub-datasets should be openable as zarr arrays by themselves. I think @thewtex's example satisfies this. Having said this, @thewtex, the xarray model looks too complex to me compared to @joshmoore's proposed spec. It would be great if it could be stripped down to its bare essentials. I agree that it's nice to have the pixel start coordinate handy, but it can also be computed after the fact, so it should be optional I think.

Last thing, which may be out of scope, but might not be: for visualisation, it is sometimes convenient to have the same array with different chunk sizes, e.g. orthogonal planes to all axes for a 3D image. I wonder if the same data/metadata layout standard can be used in these situations.

Oh and @joshmoore

anyone else who's GitHub account I've forgotten for the preliminary discussions

whose. Regret pinging me yet? =P

constantinpape commented 4 years ago

Great to see so much discussion on this proposal. I didn't have time to read through all of it yet, will try to catch up on the weekend. Fyi, there is a pyramid storage format for n5 used by BigDataViewer and paintera already and I have used this format for large volume representations as well: https://github.com/bigdataviewer/bigdataviewer-core/blob/master/BDV%20N5%20format.md

forman commented 4 years ago

Great to see this moving on!

In our projects xcube and xcube-viewer image pyramids look like so:

example.levels/
├── 0.zarr    # Full-sized array
├── 1.zarr    # Level-0 X&Y dimensions divided by 2^1
├── 2.zarr    # Level-0 X&Y dimensions divided by 2^2
├── 3.zarr    # Level-0 X&Y dimensions divided by 2^3
└── 4.zarr    # Etc.

As @joshmoore mentioned, also this goes without special metadata, because

To make pyramids discoverable, we simple use the file extension .levels.
Spatial resolutions decrease by factor 2^Level.
The number of levels is obvious from the entries in the .levels folder.
The level zero, can also be named 0.lnk. In this case it contains the path the original data rather than a copy of the "pyramidized" original dataset.

(See also the xcube level CLI tool that implements this.)

We are looking forward to adopt our code to any commonly agreed-on Zarr "standard".

joshmoore commented 4 years ago

All-

Here's a quick summary from my side of discussions up to this point. Please send corrections/additions as you see fit. ~Josh

Apparent agreement

Name

The name "multiscale" seems to be generally acceptable (https://github.com/zarr-developers/zarr-specs/issues/50#issuecomment-595332383, https://github.com/zarr-developers/zarr-specs/issues/50#issuecomment-595505162)

Multiple series

Support for multiple series per groups seems to be generally acceptable (e.g. https://github.com/zarr-developers/zarr-specs/issues/50#issuecomment-595332383).

Special names

There are a few explicit votes for no special dataset names (e.g. https://github.com/zarr-developers/zarr-specs/issues/50#issuecomment-595359246), but under "New ideas" there was one mention of group naming schemes.

Less clear

Layout

One primary decision point seems to be whether to use a deep or a flat layout:

Deep comments include https://github.com/zarr-developers/zarr-specs/issues/50#issuecomment-595269311: "allows storing other types of data (segmentations) beside the base".
Flat comments typicall revolve around simplicity: https://github.com/zarr-developers/zarr-specs/issues/50#issuecomment-595371288, https://github.com/zarr-developers/zarr-specs/issues/50#issuecomment-595332383, https://github.com/zarr-developers/zarr-specs/issues/50#issuecomment-595505162

Here I'd add that if flat is generally accepted as being the simplest approach for getting started, later revisions can always move to something more sophisticated. However, I'm pretty sure at that point we would want metadata not just at a single group level but either on multiple groups or all related datasets (or both).

Scaling information

Another key issue seems to be the scaling information. There are a range of ways that have been shown:

The fairly simple “scales”: [0.5, 0.5, 1, 1, 1], representation in the current revision (2) of this issue.
A transform submap with the keys "offset", "scale", and "units" (https://github.com/zarr-developers/zarr-specs/issues/50#issuecomment-595359246).
An xarray representation with "direction" and "units" attributes (https://github.com/zarr-developers/zarr-specs/issues/50#issuecomment-595432323).
The COSEM proposal from https://github.com/zarr-developers/zarr-specs/issues/50#issuecomment-595359246 with gridSpacing and 'origin' metadata of the form {"gridSpacing": [r_x, r_y, r_z], "origin": [o_x, o_y, o_z]}
- The BDV N5 format from https://github.com/zarr-developers/zarr-specs/issues/50#issuecomment-595662150 with downsamplingFactors at two possible locations (see "Either/or" below).

@sofroniewn even asked if they are even useful as they stand (https://github.com/zarr-developers/zarr-specs/issues/50#issuecomment-595332383).

To be honest, I punted on this issue knowing that it would be harder to find consensus on it. To my mind, this could even be a second though related extension proposal. My reasoning for that is that it can also be used to represent the relationship between non-multiscale arrays, along the lines of @jni's "multiple chunk sizes" question below, and in the case of BDV, the relationship between the individual timepoints, etc.

My first question then would be: to what extent can the current multiscale proposal be of value without the spatial/scale/transform information?

New ideas

Explicit "name" key

@d-v-b's New proposed COSEM style from https://github.com/zarr-developers/zarr-specs/issues/50#issuecomment-595359246 uses this format:

        {"multiscale": [{"name": "base",  ...}, {"name" : "L1", ...}]}

Though this would prevent directly consuming the list (e.g. datasets = multiscale["series"][0]["datasets"]), it might provide a nice balance of extensibility, especially depending on the results of the coordinates/scales/transforms discussion.

Group naming

@forman showed an example from xcube in https://github.com/zarr-developers/zarr-specs/issues/50#issuecomment-596449313 in which group names were used rather than metadata to detect levels:

example.levels/

Links

@forman also showed in https://github.com/zarr-developers/zarr-specs/issues/50#issuecomment-596449313 one solution for linking: "The level zero, can also be named 0.lnk. In this case it contains the path the original data rather then a copy of the 'pyramidized' original dataset." This would likely need to be a pre-requisite proposal for this one if we were to follow that route. cc: @alimanfoo

Either/or logic

In @d-v-b's COSEM writeup from https://github.com/zarr-developers/zarr-specs/issues/50#issuecomment-595359246, there is an example of either/or logic, where could would need to check in more than one location for a given piece of metadata:

 -     ├── (required) s1 (optional, unless "scales" is not a group level attribute): {"downsamplingFactors": [a, b, c]})

Multiple chunk sizes

@jni pondered in https://github.com/zarr-developers/zarr-specs/issues/50#issuecomment-595505162: "for visualisation, it is sometimes convenient to have the same array with different chunk sizes, e.g. orthogonal planes to all axes for a 3D image. I wonder if the same data/metadata layout standard can be used in these situations."

For the record, I'd currently err on the side of:

sticking with a flat "multiscale" object
without links or either/or logic
and without any special names,
while likely moving to the more flexible [{"name": "base"}] format
and saving coordinates for a follow-on proposal.

(whew) But opinions, as always, are very welcome.

Further CCs: @saalfeldlab @axtimwalde @tpietzsch

d-v-b commented 4 years ago

My first question then would be: to what extent can the current multiscale proposal be of value without the spatial/scale/transform information?

I think there's value in the current effort, insofar as standardizing spatial metadata is a separable issue.

For a multiscale image spec, I would propose abstracting over the specific implementation of spatial metadata, e.g. by stipulating that the group multiscale attribute must contain the same spatial metadata as the collection of array attributes. This assumes as little as possible about the details of the spatial metadata; (but a key assumption I'm making is that duplicating this metadata is not prohibitive)

For the record, I'd currently err on the side of:

sticking with a flat "multiscale" object

without links or either/or logic

and without any special names,

while likely moving to the more flexible [{"name": "base"}] format

and saving coordinates for a follow-on proposal.

These all look good to me!

thewtex commented 4 years ago

@joshmoore outstanding summary! Thanks for leading this endeavor.

My first question then would be: to what extent can the current multiscale proposal be of value without the spatial/scale/transform information?

To correctly analyze or visualize the data as a multiscale image pyramid, then some spatial/scale/transform information is required.

To:

Compare with image subregions
Handle anisotropically sampled volumes
Compare with segmentations stored as meshes whose node positions are in "world space" or were generated from a derived volume sampled on a different pixel sampling grid.
Use model-based annotations defined in "world space"
Effectively utilize image registration

Spacing / scale and offset / origin and/or transforms are required. Without them, these use cases are either complex and error prone (requiring provenance and computation related to source pixel grids), or not possible. This is why the majority of scientific imaging file formats have at least spacing / scale and offset / origin in some form.

That said, the specs could still be split into two to keep things moving along.

rabernat commented 4 years ago

Thanks so much to everyone who is putting detailed thought into this complex issue. Since the discussion has mostly focused on the bioimaging side of things, I'll try to add the xarray & geospatial perspective.

The main precedent for "multiscale arrays" in geospatial comes from the GeoTIFF / COG format. In geospatial lingo, they are called "overviews". GDAL has good documentation on this.
There has been some discussion in xarray about supporting overviews (see https://github.com/pydata/xarray/issues/3269), but it is not currently part of our data model, which is derived from the common data model and tied closely to netCDF.
However, xarray does have a very convenient utility for generating overviews: the coarsen method.
For climate model data, generating overviews is not trivial because the cell geometry can be non-euclidean. You need to know an area weighting factor to apply when coarse graining. It's not clear to me from the discussion above whether zarr needs to know how to actually generate these overviews, or if that is up to a third-party library.
Nevertheless, given the proliferation of high resolution weather and climate models, the ability to store overviews would be quite valuable, particularly for interactive visualization. For broader adoption, this concept would need to make its way into the NetCDF standard itself.
The bigger cells / pixels get, the more important become coordinates and cell bounds. It seems like this conversation is closely tied to the question of how to represent coordinates in zarr. As noted by @thewtex, we have already established some de-facto standards about how to do this in order to plug zarr into xarray. So these discussions need to happen in parallel.

hanslovsky commented 4 years ago

Great discussion. These are my $0.02. Largely, I agree with @joshmoore's summary in https://github.com/zarr-developers/zarr-specs/issues/50#issuecomment-596687408. Being able to open each scale level as an individual data set and not part of a pyramid is probably the most important feature and should be part of any standard the comes out of this. With this in mind, the spatial meta data (gridSpacing and origin) would need to be stored in the attributes of the individual datasets. This means either

duplication of the spatial meta data, or
no spatial meta data of the individual datasets in the multiscale group.

This also does not consider other spatial meta data like rotations. As far as I know, this is a relevant use case for @tpietzsch. If such (arbitrary) transforms should not be considered in the standard, then the question arises of how to combine this with the gridSpacing and origin. In such a scenario, I would probably set the origin to zero with appropriate shifts in downscaled levels as needed, and have the actual offset after the rotation in a global transform. But then again, each scale dataset could not be loaded individually with the correct scaling, rotation, and offset, without explicit knowledge of the pyramid.

Other than that, here are a few comments:

I think the bare minimum spatial information is going to be the gridSpacing and origin for each scale level. I do not have a strong opinion about nomenclature. In Paintera, it is resolution and offset, but I am ok with anything reasonable.
If scales are defined, they should be fully specified for all of the spatial dimensions, i.e. for 3D or 3D+channel, it would be [[sx, sy, sz], ...].I like having the scales attribute but the scales can be inferred from gridSpacing, so it is redundant information.
I prefer the format that @d-v-b proposed that stores an array of dictionaries for the datasets, e.g.
```
[{"name": "s0", "meta1": ...}, {"name": "s1", "meta1": ...}]
```
over storing multiple arrays like
```
{"datasets": ["s0", "s1", ...], "meta1": [...]}
```
I do like the idea of having multiple multi-scale groups within a group and specifying scale levels at arbitrary paths (relative to the group). I had not thought of that before but it sounds very intriguing. On caveat here is that it may get out of control and result in very chaotic dataset hierarchies but that would be the responsibility of the user. I am not aware of any good restriction, yet. Considering this extension, maybe using "path" as a key instead of "name" in @d-v-b's proposal may be more descriptive and appropriate.

I think that a common standard would be a great thing to have and help interaction between the wealth of tools that we are looking at. Paintera does not have a great standard and should update its format if a reasonable standard comes out of this (while maintaining backwards compatibility).

Disclaimer: I will start a position outside academia soon and will not be involved in developing tools in this realm after that. My comment should be regarded as food for thought and to raise concerns that may not have been considered yet. Ultimately, I will not be involved in the decision making of any specifics of this standard.

cc @igorpisarev

joshmoore commented 4 years ago

Apologies, all, for letting this slip into April. Hopefully everyone's managing these times well enough despite the burden of long spec threads.

I've updated the description to include the new {"name": ...} syntax and added a new deadline of April 15th for further responses.

A few points on the more recent comments:

In https://github.com/zarr-developers/zarr-specs/issues/50#issuecomment-599782137, @hanslovsky suggested "path" rather than "name". I'm on board and will make the change if there are no vetoes, but in the documentation for the metadata (when it appears) we will need to specify whether or not super- and sub-paths are allowed (i.e. ".." and "/").
Then the general topic of the spatial metdata. Both https://github.com/zarr-developers/zarr-specs/issues/50#issuecomment-596719361 and https://github.com/zarr-developers/zarr-specs/issues/50#issuecomment-596701736 give a :thumbsup: to splitting it out into a separate proposal. It sounds like https://github.com/zarr-developers/zarr-specs/issues/50#issuecomment-599782137 is proposing to have origin and gridSpacing (and not scale) on the datasets themselves rather than the group. If there were agreement on that, I'd omit scale from this proposal and hold off for the next. @d-v-b may be the main opponent of that where in https://github.com/zarr-developers/zarr-specs/issues/50#issuecomment-599782137 there's a clear call for duplicating the metadata when/if possible. My major concern with duplication would be keeping the two representations consistent.
As an aside on the geospatial front, https://gis.stackexchange.com/a/255847 helped me understand the GeoTIFF overviews. I don't see anything contradictory.
@rabernat brings up in https://github.com/zarr-developers/zarr-specs/issues/50#issuecomment-598456145 non-Euclidean geometries which were also discussed in a recent Zarr call. I'm all for saving that for the follow-up discussion, since it's likely going to be a big one. I'd tend to err on the side of having that external, though perhaps storing (non-standardized?) provenance metadata if possible.

Otherwise, it sounds like the newer comments are generally onboard with the current proposal, but let me know if I've dropped anyone's concerns.

d-v-b commented 4 years ago

I like path much more than name. +1 to that.

My major concern with duplication would be keeping the two representations consistent.

This is a valid concern. Personally I don't like duplicating spatial metadata in the group -- my original conception a long time ago was for the group multiscale metadata to simply list the names/paths to the datasets that comprise the pyramid, with no additional information. But I was reminded by @axtimwalde that accessing metadata from multiple files on cloud stores can be bothersome, and this led to the idea of consolidating the array metadata at the group level. Maybe this can be addressed via the consolidated metadata functionality that has already been added to zarr: https://zarr.readthedocs.io/en/latest/tutorial.html#consolidating-metadata.

For a spec, a way to resolve this could be to specify that, for dataset entry in the group multiscale metadata, a path field is required but additional fields per dataset are optional. In this regime, programs that attempt to parse the multiscale group may look for consolidated metadata in the group attributes, but they should have a fallback routine that involves parsing the individual attributes of the datasets.

axtimwalde commented 4 years ago

What would we do if cloud storage wouldn't have high latency? I am similarly worried about the consolidated meta-data hack because we may store a lot of meta-data and parsing very long JSON texts isn't particularly fast either, it also doesn't scale very well.

joshmoore commented 4 years ago

NB: Updated description to use "path".

https://github.com/zarr-developers/zarr-specs/issues/50#issuecomment-607947712

I had never considered a level of consolidation between none and everything, e.g. all arrays (but not groups) within a group are cached within the group metadata. It's an interesting idea, but discussing it here seems dangerous.

If we assume that consolidation is out-of-scope for this issue, I think the only question remaining is if we want optional spatial metadata at the group level, where the array metadata would take precedence. Here, I'd likely also vote for being conservative and not doing that at this point, though we could add it in the future (more easily than we could remove it).

If all agree, I'll add hopefully one last update to remove all mention of "scale" and then start collecting all the spatial ideas that we've tabled in this issue into a new one.

joshmoore commented 4 years ago

Description now updated removing use of "scale" and clarifying a few items like the ordering of the datasets which have come up recently during conversations on image.sc, twitter, etc. Thanks again to everyone for the feedback.

joshmoore commented 4 years ago

This issue has been migrated to image.sc after the 2020-05-06 community discussion and will be closed. Authors are still encouraged to make use of the specification in their own libraries. As the v3 extension mechanism matures, the specification will be updated and registered as appropriate. Many thanks to everyone who has participated to date. Further feedback and request changes are welcome either on this repository or on image.sc.