zarr-developers / zeps

Zarr Enhancement Proposals
https://zarr.dev/zeps
Creative Commons Zero v1.0 Universal
12 stars 15 forks source link

Add ZEP 8 (URL syntax) draft #48

Open jbms opened 1 year ago

jbms commented 1 year ago

@normanrz Please take a look.

jbms commented 1 year ago

@MSanKeys963 Looks like there is an issue with the docs build that is unrelated to this PR.

jbms commented 1 year ago

@martindurant Would appreciate your perspective on this --- I imagine you might say that we should just use fsspec syntax instead, though.

martindurant commented 1 year ago

Well indeed, I could say "why invent another"; although translating between | and :: syntax ought to be straight forward. fsspec also cares about fs parameters that might be embedded in URLs and wildcards for globbing.

normanrz commented 1 year ago

While standardizing a URL scheme has benefits on its own, I think the main benefit/motivation for this ZEP is the formalization of Zip stores. Essentially, to comply with this ZEP, implementations need to implement zip stores. Maybe that should be written out more explicitly?

jbms commented 1 year ago

While standardizing a URL scheme has benefits on its own, I think the main benefit/motivation for this ZEP is the formalization of Zip stores. Essentially, to comply with this ZEP, implementations need to implement zip stores. Maybe that should be written out more explicitly?

While this ZEP was prompted by our discussion about zip stores, my intention was that we standardize on the syntax for various protocols, but that implementations would choose which ones to support.

I think we could also push implementations to support zip format, but I'm not sure I want to tie that to this URL syntax proposal.

normanrz commented 1 year ago

@ap-- I think this might also be interesting for upath to implement.

normanrz commented 1 year ago

@bogovicj this might also be relevant for your OME transformations proposal.

MSanKeys963 commented 1 year ago

@MSanKeys963 Looks like there is an issue with the docs build that is unrelated to this PR.

@jbms: I have added #51 to fix the RTD build. Can you please update your PR? (Seems like I'm unable to update your PR)

bogovicj commented 11 months ago

Thanks @jbms for putting this together! There are a few situations I came up with for which I'm not sure what the relative URL should be

What does it look like to use ..: to "go up" multiple levels? Is this correct / valid?

Base URL: gs://bucket/0.zip|zip:a|zarr3:i Relative URL: ..:..:1.zip|zip:b|zarr3:ii Resolved URL: gs://bucket/1.zip|zip:b|zarr3:ii

Is it correct / valid to use .. in the "path part" of relative URL, after a ..:?

Base URL: gs://bucket/0/a/i.zarr|zarr3:foo Relative URL: ..:../b/i.zarr|zarr3:foo Resolved URL: gs://bucket/0/b/i.zarr|zarr3:foo

If one needs to add an adapter in a relative way, how does one go about it? For example:

Base URL: gs://bucket/0/a/i.zarr Desired Resolved URL: gs://bucket/0/a/i.zarr|zarr3:foo

Which, if any, of these do you think should be used? Are any of these invalid?

bogovicj commented 11 months ago

One more thing:

We've found it useful to be able to reference a particular part of the attributes stored in json with a URL. For example, for

this zarr3 zarr.json ``` { "zarr_format": 3, "node_type": "array", "shape": [10000, 1000], "dimension_names": ["rows", "columns"], "data_type": "float64", "chunk_grid": { "name": "regular", "configuration": { "chunk_shape": [1000, 100] } }, "chunk_key_encoding": { "name": "default", "configuration": { "separator": "/" } }, "codecs": [{ "name": "gzip", "configuration": { "level": 1 } }], "fill_value": "NaN", "attributes": { "foo": 42, "bar": "apples", "baz": [1, 2, 3, 4] } } ```

Could you envision adding an attributes: or zarr.json:, or similar adapter, that enaables this?

For example: gs://bucket/0.zip|zip:a|zarr3:i|zarr.json:attributes/foo

A specific use case: I often re-use and reference transformations. Since these are described by metadata (not arrays), and so referencing the specific metadata is helpful.

For example, if this were adopted, something like this would not uncommon in my workflows:

{
    "type" : "sequence",
    "transformations" : [
        { "url" : "..:/localTransformations|zarr.json:/transform[1]" },
        { "url" : "gs://bucket/path/to/templateTransformation.zarr|zarr3:sharedTransforms|zarr.json:/transform[0]" },
    ]
}
jbms commented 11 months ago

On Tue, Nov 14, 2023, 05:53 John Bogovic @.***> wrote:

Thanks @jbms https://github.com/jbms for putting this together! There are a few situations I came up with for which I'm not sure what the relative URL should be

What does it look like to use ..: to "go up" multiple levels? Is this correct / valid?

Base URL: gs://bucket/0.zip|zip:a|zarr3:i Relative URL: ..:..:1.zip|zip:b|zarr3:ii Resolved URL: gs://bucket/1.zip|zip:b|zarr3:ii

I was imagining that the relative url would be:

|..|..:1.zip|zip:b|zarr3:ii

The part after the | is always the scheme, and a scheme of .. is needed to get to the parent store.

Is it correct / valid to use .. in the "path part" of relative URL, after a ..:?

Base URL: gs://bucket/0/a/i.zarr|zarr3:foo Relative URL: ..:../b/i.zarr|zarr3:foo Resolved URL: gs://bucket/0/b/i.zarr|zarr3:foo

If one needs to add an adapter in a relative way, how does one go about it? For example:

Base URL: gs://bucket/0/a/i.zarr Desired Resolved URL: gs://bucket/0/a/i.zarr|zarr3:foo`

Which, if any, of these do you think should be used? Are any of these invalid?

  • .|zarr3:foo (clearest to me)
  • |zarr3:foo
  • zarr3:foo

I was imagining |zarr3:foo

The existing standard interpretation of a relative url of . means to strip everything after the last slash, and we should be consistent with that. Therefore if the base url were specified as gs://bucket/0/a/i.zarr/ then .|zarr3:foo would also be valid, but probably should not be preferred.

— Reply to this email directly, view it on GitHub https://github.com/zarr-developers/zeps/pull/48#issuecomment-1810244307, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABAEJ2TUR5G466LQFB4DE63YENZUBAVCNFSM6AAAAAA4R5AJVCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMJQGI2DIMZQG4 . You are receiving this because you were mentioned.Message ID: @.***>

jbms commented 11 months ago

On Tue, Nov 14, 2023, 07:21 John Bogovic @.***> wrote:

One more thing:

We've found it useful to be able to reference a particular part of the attributes stored in json with a URL. For example, for this zarr3 zarr.json

{ "zarr_format": 3, "node_type": "array", "shape": [10000, 1000], "dimension_names": ["rows", "columns"], "data_type": "float64", "chunk_grid": { "name": "regular", "configuration": { "chunk_shape": [1000, 100] } }, "chunk_key_encoding": { "name": "default", "configuration": { "separator": "/" } }, "codecs": [{ "name": "gzip", "configuration": { "level": 1 } }], "fill_value": "NaN", "attributes": { "foo": 42, "bar": "apples", "baz": [1, 2, 3, 4] } }

  • /attributes/baz[0] points to 1
  • /shape points to [10000, 1000]
  • /chunk_grid/configuration points to { "chunk_shape": [1000, 100] }

Could you envision adding an attributes: or zarr.json:, or similar adapter, that enaables this?

Yes, having a scheme for accessing an attribute sounds like a good idea.

One option would be a specific scheme for zarr attributes, like zarr3a, e.g:

"gs://bucket/0.zip|zip:a|zarr3:i|zarr3a:/foo"

or

"gs://bucket/0.zip|zip:a/i|zarr3a:/foo"

Another option would be a json scheme for accessing any json file, e.g.:

"gs://bucket/0.zip|zip:a|zarr3:i/zarr.json|json:/attributes/foo"

Then there is the question of what syntax to use for specifying the path within the json document. A natural choice would be the existing json pointer syntax (https://datatracker.ietf.org/doc/html/rfc6901), e.g. "/transform/1". The json pointer syntax does use an unusual escaping syntax for handling member names containing "/": for example, if you have an object like:

{"foo/bar": 10. "foo~bar": 11}

then to access the 10 value you use a json pointer of "/foo~1bar", and to access the 11 value you use a json pointer of "/foo~0bar".

In my opinion this escaping mechanism is rather unfortunate since it is easy to forget the meaning of "~0" and "~1", but it isn't an issue if you can avoid using "/" or "~" in member names.

For example: gs://bucket/0.zip|zip:a|zarr3:i|zarr.json:attributes/foo

A specific use case: I often re-use and reference transformations. Since these are described by metadata (not arrays), and so referencing the specific metadata is helpful.

For example, if this were adopted, something like this would not uncommon in my workflows:

{ "type" : "sequence", "transformations" : [ { "url" : "..:/localTransformations|zarr.json:/transform[1]" }, { "url" : "gs://bucket/path/to/templateTransformation.zarr|zarr3:sharedTransforms|zarr.json:/transform[0]" }, ] }

— Reply to this email directly, view it on GitHub https://github.com/zarr-developers/zeps/pull/48#issuecomment-1810441383, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABAEJ2TAPT5G4BH5TRGA2TDYEOD5ZAVCNFSM6AAAAAA4R5AJVCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMJQGQ2DCMZYGM . You are receiving this because you were mentioned.Message ID: @.***>