zarr-developers / zarr-specs

Zarr core protocol for storage and retrieval of N-dimensional typed arrays
https://zarr-specs.readthedocs.io/
Creative Commons Attribution 4.0 International
85 stars 28 forks source link

NCZarr - Netcdf Support for Zarr #41

Open DennisHeimbigner opened 5 years ago

DennisHeimbigner commented 5 years ago

I am moving the conversation about NCZarr to its own issue. See Issue https://github.com/zarr-developers/zarr/issues/317 for initial part of this discussion.

shoyer commented 3 years ago

This is incorrect. The search upwards rule only applies to the CDL representation. It is perfectly legal to use any dimension for a variable, including one arbitrarily deep in some other group. So this is why FQNs are required.

From the netCDF documentation on groups: https://www.unidata.ucar.edu/software/netcdf/docs/groups.html "Dimensions are scoped such that they are visible to all child groups. For example, you can define a dimension in the root group, and use its dimension id when defining a variable in a sub-group."

In my opinion, this is a good constraint to enforce, for the sake of user sanity :). Even if netCDF4 doesn't enforce it, we could enforce it for NCZarr.

czender commented 3 years ago

I was just pointed to this discussion by a helpful colleague. Regarding dimensions that are not defined in groups that are direct ancestors of the group(s) where they are used: As Dennis says, netCDF4 allows it. Moreover, NASA stores a large amount of data in the HDF5-EOS format. Some of those datasets (e.g., the OMI instrument on the Aura satellite) put the geophysical fields (e.g., ozone) in groups (e.g., sub-groups) that are cousins (not children) of the groups where the spatial/temporal dimensions are defined. I hope a solution is found that is both user/developer-friendly and ensures that NCZarr has the flexibility it needs to represent and access such datasets. I do agree it is a poor practice to for data producers to organize data in ways that make direct ancestor searches insufficient to resolve all the metadata values. FYI: The CF Metadata Convention formally adopted Groups as part of CF-1.8 in 2019. However, CF-compliant datasets may only use dimensions defined in direct ancestors of the groups where they are referenced. In other words, CF does not support the full generality of dimension references supported by netCDF4, such as the HDF5-EOS datasets mentioned above.

shoyer commented 3 years ago

@czender thanks for joining us! Interesting to hear that non-hierarchical dimensions are in fact widely used. So it sounds like NCZarr needs this feature.

It sounds like this needs a new extension point like the proposed ​_nczarr_array attribute. The problem is that there are likely already Xarray datasets that have been written to Zarr groups without fully qualified names, and there is also other application code (e.g., neuroglancer) that expects the entries in _ARRAY_DIMENSIONS to be plain names.

That said, I do still think there is value in having a "domain agnostic" way to represent dimension names on arrays, without requiring associated dimension objects. Non-geoscience libraries like Neuroglancer may not care enough to implement the full NCZarr standard, but they would be willing to handle basic dimension names.

Ideally NCZarr could be a super-set of this domain agnostic functionality, e.g., perhaps by writing the _ARRAY_DIMENSIONS attribute even if it is redundant with _nczarr_array.

shoyer commented 3 years ago

Another thought: perhaps it would make sense to name these NCZarr attributes in all capitals like _NCZARR_ARRAY (rather than lowercase). This would be a little more consistent with Xarray's _ARRAY_DIMENSIONS naming convention, and would distinguish them more clearly from typical user provided attributes.

DennisHeimbigner commented 3 years ago

I have been working on how to allow nczarr to interoperate with zarr numcodec JSON filter definitions. I have a blog post of a proposal. It is a bit netcdf-c centric, but some in the zarr community might find it interesting:

https://www.unidata.ucar.edu/blogs/developer/en/entry/nczarr-support-for-zarr-filters

rouault commented 3 years ago

I've given a try at netcdf-c master with the Version 2 NCZarr Extented Meta-data, and I think it is slightly non conformant with the Zarr V2 specification

I've converted a netcdf file to the new NCZarr format with nccopy and I see in a .zarray file


{
  "zarr_format": 2,
  "shape": [
    20,
    20
  ],
  "dtype": "<i1",
  "chunks": [
    20,
    20
  ],
  "order": "C",
  "compressor": null,
  "filters": null,
  "_NCZARR_ARRAY": {
    "dimrefs": [
      "/y",
      "/x"
    ],
    "storage": "chunked"
  }
}

The presence of the _NCZARR_ARRAY key seems to contradict this point of the spec in https://zarr.readthedocs.io/en/stable/spec/v2.html#metadata : "Other keys SHOULD NOT be present within the metadata object and SHOULD be ignored by implementations."

DennisHeimbigner commented 3 years ago

There was a long conversation about this on the Zarr spec github. I pointed out that the new "dimension_separator" key violated this constraint. The consensus seemed to be that extra keys would be allowed, but must be ignored if they are not recognized by the implementation. I have not checked to see if that change has made it into the spec yet.

shoyer commented 3 years ago

I agree, I think any reasonable implementation should ignore unrecognized keys or files. Hopefully this will be codified in zarr v3.

On Sun, Aug 1, 2021 at 11:30 AM Dennis Heimbigner @.***> wrote:

There was a long conversation about this on the Zarr spec github. I pointed out that the new "dimension_separator" key violated this constraint. The consensus seemed to be that extra keys would be allowed, but must be ignored if they are not recognized by the implementation. I have not checked to see if that change has made it into the spec yet.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/zarr-developers/zarr-specs/issues/41#issuecomment-890566349, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJJFVXS6D3GDCMQPJULL4DT2WHDLANCNFSM4H5L7P7A .

rouault commented 3 years ago

The consensus seemed to be that extra keys would be allowed, but must be ignored if they are not recognized by the implementation.

ok, thanks for the clarification

joshmoore commented 3 years ago

for reference: https://github.com/zarr-developers/zarr-python/pull/715#issuecomment-821094669

That didn't make it into a zarr-specs issue (neither v2 nor v3) as far as I can tell. Anyone up for shepherding that?

joshmoore commented 2 years ago

See the related conversation in https://github.com/pydata/xarray/issues/6374 ("Should the [xarray-]zarr backend support NCZarr conventions?")

halehawk commented 2 years ago

@DennisHeimbigner does NCZarr support any filter now?

DennisHeimbigner commented 2 years ago

Yes although there are some complications because the code uses HDF5 filters to perform the actual filter code, but it needs extra code to convert a Zarr codec JSON format to the HDF5 unsigned integer parameters. What specific filter(s) do you need?

halehawk commented 2 years ago

@DennisHeimbigner Do you have documentations about how to enable and use the filter through NCZarr, and we have a new codec which is not binding to any codec yet. Do you have suggestion on how to enable it in your NCZarr?

shaomeng commented 2 years ago

@DennisHeimbigner @halehawk Maybe I should jump in now ;)

I have a lossy compressor product (SPERR: https://github.com/shaomeng/SPERR) that I'm looking at paths to integrate into the Zarr format. I haven't spent too much time on it, but my understanding is that I need to make it a Zarr filter. Our immediate application of it, an ASD run of MURam, has decided to use NCZarr to output Zarr files, so the question arises that if Zarr filters are supported by NCZarr.

I guess the most direct question to @DennisHeimbigner as the NCZarr developer is, what approach do you recommend to integrate a lossy compressor to an NCZarr output?

DennisHeimbigner commented 2 years ago

If the compressor is (or easily could be) written in python, then see the NumCodecs web page. If the compressor is in C or C++, and you decide to use netcdf-c NCZarr, then you need to build an HDF5 compressor wrapper plus the corresponding codecs API. I have attached the relevant documentation. If this compressor is similar to some existing compressor such as bzip2 or zstandard, then you can copy and modify the corresponding wrapper in netcdf-c/plugins directory -- H5Zzstd.c, for example. filters.md

shaomeng commented 2 years ago

That's super helpful, thanks for the pointer! One more question: can the compression-enabled NCZarr output read by zarr tools in the python ecosystem?

On Fri, Apr 8, 2022, 3:56 PM Dennis Heimbigner @.***> wrote:

If the compressor is (or easily could be) written in python, then see the NumCodecs web page. If the compressor is in C or C++, and you decide to use netcdf-c NCZarr, then you need to build an HDF5 compressor wrapper plus the corresponding codecs API. I have attached the relevant documentation. If this compressor is similar to some existing compressor such as bzip2 or zstandard, then you can copy and modify the corresponding wrapper in netcdf-c/plugins directory -- H5Zzstd.c, for example. filters.md https://github.com/zarr-developers/zarr-specs/files/8455495/filters.md

— Reply to this email directly, view it on GitHub https://github.com/zarr-developers/zarr-specs/issues/41#issuecomment-1093392829, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGG6JNSZGN6IUUZAPCW6ODVECTQTANCNFSM4H5L7P7A . You are receiving this because you commented.Message ID: @.***>

DennisHeimbigner commented 2 years ago

Yes, IF the filters are available in NumCodecs.

halehawk commented 2 years ago

Does it mean the compressor would be better to be integrated in numcodecs if it wants to be used by nczarr and zarr/xarray?

Haiying

Sent from my iPhone

On Apr 8, 2022, at 5:03 PM, Dennis Heimbigner @.***> wrote:

 Yes, IF the filters are available in NumCodecs.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.

DennisHeimbigner commented 2 years ago

Sorry, I wasn't clear. Suppose you use nczarr to write a Zarr file where some of its arrays apply a filter. Then you can obviously read that file with nczarr. However, suppose you write the array with nczarr and then want others to read it using python-zarr. In that case, you will need to create a NumCodecs compatible version of your filter written in python so that the python-zarr users can read the data written by nczarr.

shaomeng commented 2 years ago

Hi @DennisHeimbigner , there is some confusion our team has and we would love to know if you can comment on it.

The confusion is that do we even need to make a HDF5 filter for SPERR compressor? Because NCZarr supports NumCodecs filters, isn't it the case that once we make a NumCodecs filter for SPERR, both NCZarr and Python-Zarr can read and write SPERR compressed zarr files? More generally, are there any advantages/disadvantages to produce a HDF5 filter for SPERR, if all we want is SPERR-compressed Zarr files?

DennisHeimbigner commented 2 years ago

There are two pieces here and I am sorry I was unclear. The first piece is the declaration of the compressor for a variable in the Zarr metadata, This is specified in the "compressor" key for the .zarray metadata object for the variable. The format for this is defined by NumCodecs and generally has the form

{"id": "<compressor name>", "parameter1": <value>, ... "parametern": <value>}

So for zstd, we might have this: {"id": "zstd", "level:": 5}

The second part is the actual code that implements the compressort.

NCZarr supports the first part so that it can read/write legal Zarr metadata. BUT, NCZarr requires its filter code to be written in C (or C++). More specifically, it does not support the Python compressor code implementations. Sorry for the confusion

halehawk commented 2 years ago

@Dennis Heimbigner @.***> I still need you to clarify something here. So I looked at your H5Zzstd.c which is a HDF5 plugin for zstd and supports numcodec zstd read/write. Then I got this idea, if Samuel's new compressor need not get a formal HDF5 filter ID, but should add a similar H5Zsperr.c to your netcdf-c if he need not write to HDF5/netcdf file now. But he needs to be in numcodecs to get Zarr/numcodecs support.

On Tue, Apr 12, 2022 at 12:20 PM Dennis Heimbigner @.***> wrote:

There are two pieces here and I am sorry I was unclear. The first piece is the declaration of the compressor for a variable in the Zarr metadata, This is specified in the "compressor" key for the .zarray metadata object for the variable. The format for this is defined by NumCodecs and generally has the form

{"id": "", "parameter1": , ... "parametern": }

So for zstd, we might have this: {"id": "zstd", "level:": 5}

The second part is the actual code that implements the compressort.

NCZarr supports the first part so that it can read/write legal Zarr metadata. BUT, NCZarr requires its filter code to be written in C (or C++). More specifically, it does not support the Python compressor code implementations. Sorry for the confusion

— Reply to this email directly, view it on GitHub https://github.com/zarr-developers/zarr-specs/issues/41#issuecomment-1097044935, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACAPEFEAYJPY2GZOFJWM6ODVEW5F5ANCNFSM4H5L7P7A . You are receiving this because you were mentioned.Message ID: @.***>

shaomeng commented 2 years ago

NCZarr requires its filter code to be written in C (or C++).

Just to clarify, did you mean that NCZarr requires its filter code to be in C AND also exposed to NCZarr as HDF5 filters? E.g., NumCodecs filters won't work.

Sorry for the back and forth in this github thread. I think this is my last try and if there's still confusion, I'll try to set up a meeting and resolve it more directly :)

jbms commented 2 years ago

I don't know the details of how codecs are defined for NCZarr, but in general you will need to provide a separate implementation of a codec for each zarr implementation that you want to support it.

Zarr-python provides a mechanism by which codecs can be registered --- numcodecs defines many codecs, and zarr-python pulls in numcodecs as a dependency, but it is actually possible to define a codec for zarr-python outside of the numcodecs package --- see for example the imagecodecs Python package.

DennisHeimbigner commented 2 years ago

@dennis Heimbigner @.***> I still need you to clarify something here. So I looked at your H5Zzstd.c which is a HDF5 plugin for zstd and supports numcodec zstd read/write. Then I got this idea, if Samuel's new compressor need not get a formal HDF5 filter ID, but should add a similar H5Zsperr.c

That is correct. THe HDF group reserves ids 32768 to 65535 for unregistered use. So Samuel can pick a number in that range for his filter; later, if desired, a forml HDF5 filter id can be assigned.

joshmoore commented 2 years ago

First a big :100: for the discussion, since this is exactly what we want to see happening for cross-implementation support of codecs. @shaomeng & @halehawk, don't hesitate to keep asking.

I do wonder, @DennisHeimbigner, if we don't want to establish the channel you'd like for more nczarr questions. If so, I'd say we update the start and end of this thread with that URL and close this issue.

Others may want to express an opinion, but if it's useful, we can have a no-code location like github.com/zarr-developers/nczarr for people to find a README pointing to the netcdf-c implementation's resources.

cc: @WardF

WardF commented 2 years ago

Sorry for the late comment on this; I would agree that maybe a 'Github Discussions' post would be a better place for this, instead of the issue we are working within. We can create that over at the netcdf-c repository, or we could create one here in the appropriate zarr-* repositories. There are arguments to be made for either, so I am happy to go with what makes the most sense for the broader group :).

briannapagan commented 1 year ago

21-050r1_Zarr_Community_Standard.pdf Adding this here for reference in the convo.

dblodgett-usgs commented 1 year ago

Pertinent text from @briannapagan's link above...

Beginning with NetCDF-C version 4.8.0, Unidata introduced experimental Zarr support into the NetCDF-C library. This was accomplished via creating a new specification - NCZarr - which is “similar to, but not identical with the Zarr Version 2 Specification.” Specifically, NCZarr adds two additional metadata files (“.nczarray" and ".nczattr”), which are not part of the Zarr V2 Spec. Since NCZarr stores are not fully compatible and interoperable with Zarr V2, this community standard excludes NCZarr. Work is ongoing to reconcile NCZarr and the architectural reasons that motivated its development with the forthcoming Zarr V3 Specification. Fortunately, the NetCDF-C library also supports reading / writing of data using the simpler Named Dimension convention described in 4.1.

DennisHeimbigner commented 1 year ago

That information is out-of-date in a couple of ways.

  1. the metadata files (“.nczarray" and ".nczattr”) are no longer used; they were replaced with special dictionary entries.
  2. I believe the spec was changed to specify that unrecognized elements (objects and dictionary entries) should be ignored by any implementation that does not recognize them.
  3. With #2 in effect, nczarr created files can be read by pure zarr implementations and nczarr can read pure zarr files.
dblodgett-usgs commented 1 year ago

Thanks for calling that out, @DennisHeimbigner. This came out of a conversation over here. https://github.com/zarr-developers/geozarr-spec/issues/22

There are very few people who have a deep enough understanding of the moving parts here to answer all the questions. It's good to hear that we basically have interoperability.

Two questions:

  1. Do you feel like we even need to worry about the distinction right now?
  2. Is there a current document we should be using to learn about the nuances between "pure zarr" and "nczarr"?