zarr-developers / zarr-specs

Zarr core protocol for storage and retrieval of N-dimensional typed arrays
https://zarr-specs.readthedocs.io/
Creative Commons Attribution 4.0 International
89 stars 28 forks source link

Extension proposal: multiscale arrays v0.1 #50

Closed joshmoore closed 4 years ago

joshmoore commented 4 years ago

This issue has been migrated to an image.sc topic after the 2020-05-06 community discussion. Authors are still encouraged to make use of the specification in their own libraries. As the v3 extension mechanism matures, the specification will be updated and registered as appropriate. Feedback and request changes are welcome either on this repository or on image.sc.


As a first draft of support for the multiscale use-case (https://github.com/zarr-developers/zarr-specs/issues/23), this issue proposes an intermediate nomenclature for describing groups of Zarr arrays which are scaled down versions of one another, e.g.:

example/
├── 0    # Full-sized array
├── 1    # Scaled down 0, e.g. 0.5; for images, in the X&Y dimensions
├── 2    # Scaled down 1, ...
├── 3    # Scaled down 2, ...
└── 4    # Etc.

This layout was independently developed in a number of implementations and has since been implemented in others, including:

Using a common metadata representation across implementations:

  1. fosters a common vocabulary between existing implementations
  2. enables other implementations to reliably detect multiscale arrays
  3. permits the upgrade of v0.1 arrays to future versions of this or other extension
  4. tests this extension for limitations against multiple use cases

A basic example of the metadata that is added to the containing Zarr group is seen here:

{
  “multiscales”: [
    {
      “datasets” : [
          {"path": "0"},
          {"path": "1"},
          {"path": "2"},
          {"path": "3"},
          {"path": "4"}
        ]
      “version” : “0.1”
    }
     // See the detailed example below for optional metadata
  ]
}

Process

An RFC process for Zarr does not yet exist. Additionally, the v3 spec is a work-in-progress. However, since the implementations listed above as well as others are already being developed, I'd propose that if a consensus can be reached here, this issue should be turned into an .rst file similar to those in the v3 branches (e.g. filters) and used as a temporary spec for defining arrays with the understanding that this a prototype intended to be amended and brought into the general extension mechanism as it develops.

I'd welcome any suggestions/feedback, but especially around:

Deadline for a first round of comments: March 15, 2020 Deadline for a second round of comments: April 15, 2020

Detailed example

Color key (according to https://www.ietf.org/rfc/rfc2119.txt):

- MUST     : If these values are not present, the multiscale series will not be detected.
! SHOULD   : Missing values may cause issues in future versions.
+ MAY      : Optional values which can be readily omitted.
# UNPARSED : When updating between versions, no transformation will be performed on these values.

Color-coded example:

-{
-  "multiscales": [
-    {
!      "version": "0.1",
!      "name": "example",
-      "datasets": [
-        {"path": "0"},
-        {"path": "1"},
-        {"path": "2"}
-      ],
!      "type": "gaussian",
!      "metadata": {
+        "method":
#          "skiimage.transform.pyramid_gaussian",
+        "version":
#          "0.16.1",
+        "args":
#          [true],
+        "kwargs":
#          {"multichannel": true}
!      }
-    }
-  ]
-}

Explanation

Type enumeration:

Sample code

#!/usr/bin/env python
import argparse
import zarr
import numpy as np
from skimage import data
from skimage.transform import pyramid_gaussian, pyramid_laplacian

parser = argparse.ArgumentParser()
parser.add_argument("zarr_directory")
ns = parser.parse_args()

# 1. Setup of data and Zarr directory
base = np.tile(data.astronaut(), (2, 2, 1))

gaussian = list(
    pyramid_gaussian(base, downscale=2, max_layer=4, multichannel=True)
)

laplacian = list(
    pyramid_laplacian(base, downscale=2, max_layer=4, multichannel=True)
)

store = zarr.DirectoryStore(ns.zarr_directory)
grp = zarr.group(store)
grp.create_dataset("base", data=base)

# 2. Generate datasets
series_G = []
for g, dataset in enumerate(gaussian):
    if g == 0:
        path = "base"
    else:
        path = "G%s" % g
        grp.create_dataset(path, data=gaussian[g])
    series_G.append({"path": path})

series_L = []
for l, dataset in enumerate(laplacian):
    if l == 0:
        path = "base"
    else:
        path = "L%s" % l
        grp.create_dataset(path, data=laplacian[l])
    series_L.append({"path": path})

# 3. Generate metadata block
multiscales = []
for name, series in (("gaussian", series_G),
                     ("laplacian", series_L)):
    multiscale = {
      "version": "0.1",
      "name": name,
      "datasets": series,
      "type": name,
    }
    multiscales.append(multiscale)
grp.attrs["multiscales"] = multiscales

which results in a .zattrs file of the form:

{
    "multiscales": [
        {
            "datasets": [
                {
                    "path": "base"
                },
                {
                    "path": "G1"
                },
                {
                    "path": "G2"
                },
                {
                    "path": "G3"
                },
                {
                    "path": "G4"
                }
            ],
            "name": "gaussian",
            "type": "gaussian",
            "version": "0.1"
        },
        {
            "datasets": [
                {
                    "path": "base"
                },
                {
                    "path": "L1"
                },
                {
                    "path": "L2"
                },
                {
                    "path": "L3"
                },
                {
                    "path": "L4"
                }
            ],
            "name": "laplacian",
            "type": "laplacian",
            "version": "0.1"
        }
    ]
}

and the following on-disk layout:

/var/folders/z5/txc_jj6x5l5cm81r56ck1n9c0000gn/T/tmp77n1ga3r.zarr
├── G1
│   ├── 0.0.0
...
│   └── 3.1.1
├── G2
│   ├── 0.0.0
│   ├── 0.1.0
│   ├── 1.0.0
│   └── 1.1.0
├── G3
│   ├── 0.0.0
│   └── 1.0.0
├── G4
│   └── 0.0.0
├── L1
│   ├── 0.0.0
...
│   └── 3.1.1
├── L2
│   ├── 0.0.0
│   ├── 0.1.0
│   ├── 1.0.0
│   └── 1.1.0
├── L3
│   ├── 0.0.0
│   └── 1.0.0
├── L4
│   └── 0.0.0
└── base
    ├── 0.0.0
...
    └── 1.1.1

9 directories, 54 files
Revision Source Date Description
6 External feedback on twitter and image.sc 2020-05-06 Remove "scale"; clarify ordering and naming
5 External bug report from @mtbc 2020-04-21 Fixed error in the simple example
4 https://github.com/zarr-developers/zarr-specs/issues/50#issuecomment-599782137 2020-04-08 Changed "name" to "path"
3 Discussions up through https://github.com/zarr-developers/zarr-specs/issues/50#issuecomment-599782137 2020-04-01 Updated naming schema
2 https://github.com/zarr-developers/zarr-specs/issues/50#issuecomment-595505162 2020-03-07 Fixed typo
1 @joshmoore 2020-03-06 Original text from in person discussions

Thanks to @ryan-williams, @jakirkham, @freeman-lab, @petebankhead, @jni, @sofroniewn, @chris-allan, and anyone else whose GitHub account I've forgotten for the preliminary discussions.

manzt commented 4 years ago

I'm really happy see this. We also used a similar layout for storing pyramids initially to that proposed above and it's fantastic to see this formalized.

I'm curious about the decision to store the base array in the same group as the downsampled levels. We initially did the same, but then moved towards a structure separating the two:

└── example/
    ├── .zgroup
    ├── base
    │   ├── .zarray
    │   ├── .zattrs
    │   ├── 0.0.0
    │   └── ...etc
    └── sub-resolutions/
        ├── .zgroup
        ├── .zattrs
        ├── 01/
        │   ├── .zarray
        │   ├── 0.0.0
        │   └── ...etc
        └── 02/
            ├── .zarray
            ├── 0.0.0
            └── ...etc

as a more general "image" format in zarr. One could expect to find a "base" array and then check for the "sub-resolutions" group to determine if it is a pyramid or not. We thought this structure would allow for other types of data (e.g. segmentation) to be store along side the base array. Again, thanks for the work here in formalizing this!

joshmoore commented 4 years ago

Thanks, @manzt. Let's see if there are more votes for the deeper representation. It's certainly also what I was originally thinking about in #23. The downside is that one likely needs metadata on all the datasets pointing up and down the hierarchy in order to support detection of the sequence from any scale. It's the other major design layout I can think of. (If anyone has more, those would be very welcome.)

sofroniewn commented 4 years ago

@joshmoore amazing to see this kick off. A couple short comments

d-v-b commented 4 years ago

Glad to see the discussion here. Some thoughts:

Here's example metadata that implements this concept. The specifics of the "transform attributes" don't really matter -- this could be an affine transform, or something fancier. But I think the basic idea of putting the spatial information of each dataset in the group attributes is solid.

// group attributes
{
  “multiscale”: {
    “version” : “0.1”,
        “datasets” : [“0” : {transform attributes of 0}, 
                      “1” : {transform attributes of 1}, 
                      “2” : {transform attributes of 2}, 
                      “3” : {transform attributes of 3}, 
                      "4" : {transform attributes of 4}]
    }
    // optional stuff
  }
}

// example transform attributes of dataset 0

"transform" : {
    "offset" : {"X" : 0, "Y" : 0, "Z" : 0},
    "scale" : {"X" : 1, "Y" : 1, "Z" : 1},
    "units" : {"X" : "nm", "Y" : "nm", "Z" : "nm"}
} 

// example spatial attributes of dataset 1

"transform" : {
    "offset" : {"X" : .5, "Y" : .5, "Z" : 0},
    "scale" : {"X" : 2, "Y" : 2, "Z" : 1},
    "units" : {"X" : "nm", "Y" : "nm", "Z" : "nm"}
} 

For posterity, I've written about this issue (as it pertains to the data our group works with) here

manzt commented 4 years ago

@sofroniewn The concept of having base + subresolutions like @manzt proposes is intriguing to me too. Ultimately for visualization purposes I want something like a single list of arrays so I guess I find that representation little simpler, but I can construct that from the later representation if I know the data is multiscale and maybe it is nice to keep that a little separate. I will think on it more, curious what others say.

I generally have the same feelings. I'm for the simplicity of the current proposal, and I wonder if my suggestion adds an extra layer of complexity unnecessarily.

@d-v-b For simplicity, I would propose a restriction of one multiscale representation per group. Groups are cheap; if you to represent 2 multiscale images, then make 2 groups.

Wouldn't this require copying the base image into a separate group? Perhaps I'm misunderstanding.

d-v-b commented 4 years ago

Wouldn't this require copying the base image into a separate group? Perhaps I'm misunderstanding.

The base image would be in the same group with the downscaled versions. So on the file system, it would look like this:

└── example/
    ├── .zgroup
    ├── base
    │   ├── .zarray
    │   ├── .zattrs
    │   ├── 0.0.0
    │   └── ...etc
    ├── base_downscaled
    │   ├── .zarray
    │   ├── .zattrs
    │   ├── 0.0.0
    │   └── ...etc
    ...etc
manzt commented 4 years ago

Apologies, I thought you were suggesting that separate groups should be created for different sampling of the same base image (e.g. gaussian and laplacian).

d-v-b commented 4 years ago

@manzt this is actually my mistake -- I was not thinking at all about the use case where the same base image is used for multiple pyramids, and I agree that copying data is not ideal. I will remove / amend the "one multiscale representation per group" part of my proposal above.

thewtex commented 4 years ago

I would add some dataset-specific information to the group attributes: software that consumes multiscale images needs to know about how the spatial properties of each image, and on cloud storage it can be cumbersome to query each image individually;

Adding to the practical importance here: the spatial position of the first pixel is shifted in subresolutions, and the physical spacing between pixels changes also. This must be accounted for during visualization or analysis when other datasets, e.g. other images or segmentations, come into play. If this metadata is readily and independently available for every subresolution, i.e. scale factors do not need to be fetched and computations made, each subresolution image can be used independently, effortlessly, and without computational overhead.

One option is to build on the model implied by storing images in the Xarray project data structures, which has Zarr support. This enables storing metadata such as the position of the first pixel, the spacing between pixels, and identification of the array dimensions, e.g., x, y, t, so that data can be used and passed through processing pipelines and visualization tools. This is helpful because it enables distributed computing via Dask and machine learning [2] via the scikit-learn API. Xarray has broad community adoption, and it is gaining more traction lately. Of course, a model that is compatible with Xarray does not require Xarray to use the data. On the other hand, Xarray coords have more flexibility than what is required for pixels sampled on a uniform rectilinear grid, and this adds a little complexity to the layout.

Generated from this example, here is what it looks like:

.
├── level_1.zarr
│   ├── rec20160318_191511_232p3_2cm_cont__4097im_1500ms_ML17keV_6
│   │   ├── 0.0.0
│   │   ├── 0.0.1
....
│   │   ├── 9.9.9
│   │   ├── .zarray
│   │   └── .zattrs
│   ├── x
│   │   ├── 0
│   │   ├── .zarray
│   │   └── .zattrs
│   ├── y
│   │   ├── 0
│   │   ├── .zarray
│   │   └── .zattrs
│   ├── z
│   │   ├── 0
│   │   ├── .zarray
│   │   └── .zattrs
│   ├── .zattrs
│   ├── .zgroup
│   └── .zmetadata
├── level_2.zarr
│   ├── rec20160318_191511_232p3_2cm_cont__4097im_1500ms_ML17keV_6
│   │   ├── 0.0.0
│   │   ├── 0.0.1

│   │   ├── 8.9.9
│   │   ├── .zarray
│   │   └── .zattrs
│   ├── x
│   │   ├── 0
│   │   ├── .zarray
│   │   └── .zattrs
│   ├── y
│   │   ├── 0
│   │   ├── .zarray
│   │   └── .zattrs
│   ├── z
│   │   ├── 0
│   │   ├── .zarray
│   │   └── .zattrs
│   ├── .zattrs
│   ├── .zgroup
│   └── .zmetadata
....
├── rec20160318_191511_232p3_2cm_cont__4097im_1500ms_ML17keV_6
│   ├── 0.0.0
│   ├── 0.0.1
...
│   ├── 9.9.9
│   ├── .zarray
│   └── .zattrs
├── x
│   ├── 0
│   ├── .zarray
│   └── .zattrs
├── y
│   ├── 0
│   ├── .zarray
│   └── .zattrs
├── z
│   ├── 0
│   ├── .zarray
│   └── .zattrs
├── .zattrs
├── .zgroup
└── .zmetadata

34 directories, 62359 files

This is the layout generated by xarray.DataSet.to_zarr. It does not mean that Xarray has to be used to read and write. But, it would mean that Zarr images would be extremely easy to use via xarray. In this case, .zmetadata is generated on each subresolution so it can be used entirely independently. Due to how Xarray/Zarr handles coords, x, y, are one dimensional arrays. This results in every resolution having its own group.

The metadata looks like this:

{
    "metadata": {
        ".zattrs": {
            "_MULTISCALE_LEVELS": [
                "",
                "level_1.zarr",
                "level_2.zarr",
                "level_3.zarr",
                "level_4.zarr",
                "level_5.zarr",
                "level_6.zarr"
            ],
            "_SPATIAL_IMAGE": "rec20160318_191511_232p3_2cm_cont__4097im_1500ms_ML17keV_6"
        },
Open for rest

``` ".zgroup": { "zarr_format": 2 }, "level_1.zarr/.zattrs": {}, "level_1.zarr/.zgroup": { "zarr_format": 2 }, "level_1.zarr/rec20160318_191511_232p3_2cm_cont__4097im_1500ms_ML17keV_6/.zarray": { "chunks": [ 64, 64, 64 ], "compressor": { "blocksize": 0, "clevel": 5, "cname": "zstd", "id": "blosc", "shuffle": 0 }, "dtype": "|u1", "fill_value": null, "filters": null, "order": "C", "shape": [ 1080, 1280, 1280 ], "zarr_format": 2 }, "level_1.zarr/rec20160318_191511_232p3_2cm_cont__4097im_1500ms_ML17keV_6/.zattrs": { "_ARRAY_DIMENSIONS": [ "z", "y", "x" ], "direction": [ [ 1.0, 0.0, 0.0 ], [ 0.0, 1.0, 0.0 ], [ 0.0, 0.0, 1.0 ] ], "units": "\u03bcm" }, "level_1.zarr/x/.zarray": { "chunks": [ 1280 ], "compressor": { "blocksize": 0, "clevel": 5, "cname": "lz4", "id": "blosc", "shuffle": 1 }, "dtype": "

Here _MULTISCALE_LEVELS prevents the need to hardcode the identifiers as suggested by @d-v-b @manzt , but it could be renamed to multiscale, etc. _ARRAY_DIMENSIONS is the key that Xarray uses in Zarr files to identify the dims.

This example is generated with itk, but it could also just as easily be generated with scikit-image, or dask-image via [1] (work in progress) or pyimagej.

sofroniewn commented 4 years ago

Thanks for the link to that example @thewtex! Conforming with xarray.DataSet.to_zarr where possible seems reasonable to me too.

@constantinpape, @bogovicj, @axtimwalde might also be interested in weighing in.

jni commented 4 years ago

👍 to flat vs hierarchical representation. Also 👍 to "multiscale".

I also like the constraint that the sub-datasets should be openable as zarr arrays by themselves. I think @thewtex's example satisfies this. Having said this, @thewtex, the xarray model looks too complex to me compared to @joshmoore's proposed spec. It would be great if it could be stripped down to its bare essentials. I agree that it's nice to have the pixel start coordinate handy, but it can also be computed after the fact, so it should be optional I think.

Last thing, which may be out of scope, but might not be: for visualisation, it is sometimes convenient to have the same array with different chunk sizes, e.g. orthogonal planes to all axes for a 3D image. I wonder if the same data/metadata layout standard can be used in these situations.

Oh and @joshmoore

anyone else who's GitHub account I've forgotten for the preliminary discussions

whose. Regret pinging me yet? =P

constantinpape commented 4 years ago

Great to see so much discussion on this proposal. I didn't have time to read through all of it yet, will try to catch up on the weekend. Fyi, there is a pyramid storage format for n5 used by BigDataViewer and paintera already and I have used this format for large volume representations as well: https://github.com/bigdataviewer/bigdataviewer-core/blob/master/BDV%20N5%20format.md

forman commented 4 years ago

Great to see this moving on!

In our projects xcube and xcube-viewer image pyramids look like so:

example.levels/
├── 0.zarr    # Full-sized array
├── 1.zarr    # Level-0 X&Y dimensions divided by 2^1
├── 2.zarr    # Level-0 X&Y dimensions divided by 2^2
├── 3.zarr    # Level-0 X&Y dimensions divided by 2^3
└── 4.zarr    # Etc.

As @joshmoore mentioned, also this goes without special metadata, because

(See also the xcube level CLI tool that implements this.)

We are looking forward to adopt our code to any commonly agreed-on Zarr "standard".

joshmoore commented 4 years ago

All-

Here's a quick summary from my side of discussions up to this point. Please send corrections/additions as you see fit. ~Josh

Apparent agreement

Name

The name "multiscale" seems to be generally acceptable (https://github.com/zarr-developers/zarr-specs/issues/50#issuecomment-595332383, https://github.com/zarr-developers/zarr-specs/issues/50#issuecomment-595505162)

Multiple series

Support for multiple series per groups seems to be generally acceptable (e.g. https://github.com/zarr-developers/zarr-specs/issues/50#issuecomment-595332383).

Special names

There are a few explicit votes for no special dataset names (e.g. https://github.com/zarr-developers/zarr-specs/issues/50#issuecomment-595359246), but under "New ideas" there was one mention of group naming schemes.

Less clear

Layout

One primary decision point seems to be whether to use a deep or a flat layout:

Here I'd add that if flat is generally accepted as being the simplest approach for getting started, later revisions can always move to something more sophisticated. However, I'm pretty sure at that point we would want metadata not just at a single group level but either on multiple groups or all related datasets (or both).

Scaling information

Another key issue seems to be the scaling information. There are a range of ways that have been shown:

@sofroniewn even asked if they are even useful as they stand (https://github.com/zarr-developers/zarr-specs/issues/50#issuecomment-595332383).

To be honest, I punted on this issue knowing that it would be harder to find consensus on it. To my mind, this could even be a second though related extension proposal. My reasoning for that is that it can also be used to represent the relationship between non-multiscale arrays, along the lines of @jni's "multiple chunk sizes" question below, and in the case of BDV, the relationship between the individual timepoints, etc.

My first question then would be: to what extent can the current multiscale proposal be of value without the spatial/scale/transform information?

New ideas

Explicit "name" key

@d-v-b's New proposed COSEM style from https://github.com/zarr-developers/zarr-specs/issues/50#issuecomment-595359246 uses this format:

        {"multiscale": [{"name": "base",  ...}, {"name" : "L1", ...}]}

Though this would prevent directly consuming the list (e.g. datasets = multiscale["series"][0]["datasets"]), it might provide a nice balance of extensibility, especially depending on the results of the coordinates/scales/transforms discussion.

Group naming

@forman showed an example from xcube in https://github.com/zarr-developers/zarr-specs/issues/50#issuecomment-596449313 in which group names were used rather than metadata to detect levels:

example.levels/

Links

@forman also showed in https://github.com/zarr-developers/zarr-specs/issues/50#issuecomment-596449313 one solution for linking: "The level zero, can also be named 0.lnk. In this case it contains the path the original data rather then a copy of the 'pyramidized' original dataset." This would likely need to be a pre-requisite proposal for this one if we were to follow that route. cc: @alimanfoo

Either/or logic

In @d-v-b's COSEM writeup from https://github.com/zarr-developers/zarr-specs/issues/50#issuecomment-595359246, there is an example of either/or logic, where could would need to check in more than one location for a given piece of metadata:

 -     ├── (required) s1 (optional, unless "scales" is not a group level attribute): {"downsamplingFactors": [a, b, c]})

Multiple chunk sizes

@jni pondered in https://github.com/zarr-developers/zarr-specs/issues/50#issuecomment-595505162: "for visualisation, it is sometimes convenient to have the same array with different chunk sizes, e.g. orthogonal planes to all axes for a 3D image. I wonder if the same data/metadata layout standard can be used in these situations."


For the record, I'd currently err on the side of:


Further CCs: @saalfeldlab @axtimwalde @tpietzsch

d-v-b commented 4 years ago

My first question then would be: to what extent can the current multiscale proposal be of value without the spatial/scale/transform information?

I think there's value in the current effort, insofar as standardizing spatial metadata is a separable issue.

For a multiscale image spec, I would propose abstracting over the specific implementation of spatial metadata, e.g. by stipulating that the group multiscale attribute must contain the same spatial metadata as the collection of array attributes. This assumes as little as possible about the details of the spatial metadata; (but a key assumption I'm making is that duplicating this metadata is not prohibitive)

For the record, I'd currently err on the side of:

  • sticking with a flat "multiscale" object
  • without links or either/or logic
  • and without any special names,
  • while likely moving to the more flexible [{"name": "base"}] format
  • and saving coordinates for a follow-on proposal.

These all look good to me!

thewtex commented 4 years ago

@joshmoore outstanding summary! Thanks for leading this endeavor.

My first question then would be: to what extent can the current multiscale proposal be of value without the spatial/scale/transform information?

To correctly analyze or visualize the data as a multiscale image pyramid, then some spatial/scale/transform information is required.

To:

Spacing / scale and offset / origin and/or transforms are required. Without them, these use cases are either complex and error prone (requiring provenance and computation related to source pixel grids), or not possible. This is why the majority of scientific imaging file formats have at least spacing / scale and offset / origin in some form.

That said, the specs could still be split into two to keep things moving along.

rabernat commented 4 years ago

Thanks so much to everyone who is putting detailed thought into this complex issue. Since the discussion has mostly focused on the bioimaging side of things, I'll try to add the xarray & geospatial perspective.

hanslovsky commented 4 years ago

Great discussion. These are my $0.02. Largely, I agree with @joshmoore's summary in https://github.com/zarr-developers/zarr-specs/issues/50#issuecomment-596687408. Being able to open each scale level as an individual data set and not part of a pyramid is probably the most important feature and should be part of any standard the comes out of this. With this in mind, the spatial meta data (gridSpacing and origin) would need to be stored in the attributes of the individual datasets. This means either

This also does not consider other spatial meta data like rotations. As far as I know, this is a relevant use case for @tpietzsch. If such (arbitrary) transforms should not be considered in the standard, then the question arises of how to combine this with the gridSpacing and origin. In such a scenario, I would probably set the origin to zero with appropriate shifts in downscaled levels as needed, and have the actual offset after the rotation in a global transform. But then again, each scale dataset could not be loaded individually with the correct scaling, rotation, and offset, without explicit knowledge of the pyramid.

Other than that, here are a few comments:

I think that a common standard would be a great thing to have and help interaction between the wealth of tools that we are looking at. Paintera does not have a great standard and should update its format if a reasonable standard comes out of this (while maintaining backwards compatibility).

Disclaimer: I will start a position outside academia soon and will not be involved in developing tools in this realm after that. My comment should be regarded as food for thought and to raise concerns that may not have been considered yet. Ultimately, I will not be involved in the decision making of any specifics of this standard.

cc @igorpisarev

joshmoore commented 4 years ago

Apologies, all, for letting this slip into April. Hopefully everyone's managing these times well enough despite the burden of long spec threads.

I've updated the description to include the new {"name": ...} syntax and added a new deadline of April 15th for further responses.

A few points on the more recent comments:

Otherwise, it sounds like the newer comments are generally onboard with the current proposal, but let me know if I've dropped anyone's concerns.

d-v-b commented 4 years ago

I like path much more than name. +1 to that.

My major concern with duplication would be keeping the two representations consistent.

This is a valid concern. Personally I don't like duplicating spatial metadata in the group -- my original conception a long time ago was for the group multiscale metadata to simply list the names/paths to the datasets that comprise the pyramid, with no additional information. But I was reminded by @axtimwalde that accessing metadata from multiple files on cloud stores can be bothersome, and this led to the idea of consolidating the array metadata at the group level. Maybe this can be addressed via the consolidated metadata functionality that has already been added to zarr: https://zarr.readthedocs.io/en/latest/tutorial.html#consolidating-metadata.

For a spec, a way to resolve this could be to specify that, for dataset entry in the group multiscale metadata, a path field is required but additional fields per dataset are optional. In this regime, programs that attempt to parse the multiscale group may look for consolidated metadata in the group attributes, but they should have a fallback routine that involves parsing the individual attributes of the datasets.

axtimwalde commented 4 years ago

What would we do if cloud storage wouldn't have high latency? I am similarly worried about the consolidated meta-data hack because we may store a lot of meta-data and parsing very long JSON texts isn't particularly fast either, it also doesn't scale very well.

joshmoore commented 4 years ago

NB: Updated description to use "path".

https://github.com/zarr-developers/zarr-specs/issues/50#issuecomment-607947712

I had never considered a level of consolidation between none and everything, e.g. all arrays (but not groups) within a group are cached within the group metadata. It's an interesting idea, but discussing it here seems dangerous.

If we assume that consolidation is out-of-scope for this issue, I think the only question remaining is if we want optional spatial metadata at the group level, where the array metadata would take precedence. Here, I'd likely also vote for being conservative and not doing that at this point, though we could add it in the future (more easily than we could remove it).

If all agree, I'll add hopefully one last update to remove all mention of "scale" and then start collecting all the spatial ideas that we've tabled in this issue into a new one.

joshmoore commented 4 years ago

Description now updated removing use of "scale" and clarifying a few items like the ordering of the datasets which have come up recently during conversations on image.sc, twitter, etc. Thanks again to everyone for the feedback.

joshmoore commented 4 years ago

This issue has been migrated to image.sc after the 2020-05-06 community discussion and will be closed. Authors are still encouraged to make use of the specification in their own libraries. As the v3 extension mechanism matures, the specification will be updated and registered as appropriate. Many thanks to everyone who has participated to date. Further feedback and request changes are welcome either on this repository or on image.sc.