zarr-developers / zarr-specs

Zarr core protocol for storage and retrieval of N-dimensional typed arrays
https://zarr-specs.readthedocs.io/
Other
81 stars 25 forks source link

NCZarr - Netcdf Support for Zarr #41

Open DennisHeimbigner opened 5 years ago

DennisHeimbigner commented 5 years ago

I am moving the conversation about NCZarr to its own issue. See Issue https://github.com/zarr-developers/zarr/issues/317 for initial part of this discussion.

DennisHeimbigner commented 5 years ago

Naming issue: I have about convinced myself that rather than creating KVP level objects like .zdimensions, I should just use the existing Zarr attribute mechanism. In order to do this, it is necessary to setup some naming conventions for such attributes. Basically, we need to identify that an attribute is special (and probably hidden) and for which extension(s) it applies. For NCZarr, let me propose this:

  1. All such attributes start with two underscores
  2. Next is a 2-4 character tag specific to the extension: "NCZ" for NCZarr.
  3. another underscore
  4. the rest of the attribute name.

So, we might have "__NCZ_dimensions" instead of .zdimensions.

jakirkham commented 5 years ago

Thanks for opening this @DennisHeimbigner.

Encountered issue ( https://github.com/zarr-developers/zarr/issues/280 ) again recently. So figured that might interest you given some of this discussion about how to manage specs. Though each issue has its own place I think.

If we do go down the attribute road, agree that having some non-conflicting name convention is important. The other option might be to widen the spec of things like .zarray to allow specs subclassing Zarr's spec to add additional relevant content here as others have mentioned. A third option similar to what you have done would be to add something like .zsubspec, which users can fill as needed. We might need certain keys in there like subspec name, subspec version, etc., but otherwise leave it to users to fill these out as needed.

alimanfoo commented 5 years ago

Thanks @DennisHeimbigner.

Just to add that, on the question of whether to pack everything into attributes (.zattrs) or whether to store metadata separately under other store-level keys (.zdims, .ztypdefs, etc.), I think both are reasonable and personally I have no objection to either.

I lean slightly towards using attributes (.zattrs) because it plays nicely with some existing API features. E.g., the NCZ metadata can be accessed directly via the attributes API. And, e.g., the NCZ metadata would get included if using consolidated metadata, which is an experimental approach to optimising cloud access, available in the next release of Zarr Python. But neither of these are blockers to the alternative approach, because it is straightforward to read and decode JSON objects directly from a store, and it would also be straightforward to modify the consolidated metadata code to include other objects.

DennisHeimbigner commented 5 years ago

We have learned from the existing netcdf-4 that datasets exist with very large (~14mb) metadata. I was looking at the Amazon S3 query capabilities and they are extremely limited. So the idea of consolidated metadata seems like a very good idea. This reference: https://zarr.readthedocs.io/en/latest/tutorial.html#consolidating-metadata does not provide any details of the form of the proposed consolidated metadata. Note that there may not be any point in storing all of the metadata, especially if lazy reading of metadata is being used (as it is in the netcdf-4 over hdf5 implementation). Rather I think that what is needed is just a skeleton so that query is never needed: we would consolidate the names and kinds (group, variable, dimension, etc) and leave out e.g. attributes and variable types and shapes.

DennisHeimbigner commented 5 years ago

here is a proposed consolicated metadata structure for NCZarr. It would be overkill for standard Zarr, which is simpler. Sorry if it is a bit opaque since it is a partial Antlr grammar. nczmetadata.txt

alimanfoo commented 5 years ago

We have learned from the existing netcdf-4 that datasets exist with very large (~14gb) metadata.

Wow, that's big. I think anything near that size will be very sub-optimal in zarr, because of metadata being stored as uncompressed JSON documents. I wonder if in cases like that, it might be necessary to examine what is being stored as metadata, and if any largish arrays are included then consider storing them as arrays rather than as attributes.

I was looking at the Amazon S3 query capabilities and they are extremely limited. So the idea of consolidated metadata seems like a very good idea. This reference: https://zarr.readthedocs.io/en/latest/tutorial.html#consolidating-metadata does not provide any details of the form of the proposed consolidated metadata.

Apologies the format is not documented as yet. There's an example here:

https://github.com/zarr-developers/zarr/pull/268#issuecomment-435621394

DennisHeimbigner commented 5 years ago

That was a typo. The correct size is 14 mb.

alimanfoo commented 5 years ago

That was a typo. The correct size is 14 mb.

Ah, OK! Although 14MB is still pretty big, it's probably not unmanageable.

DennisHeimbigner commented 5 years ago

Depends on what manageable means, I suppose. We have situations where projects are trying to load a small part of the metadata from thousands of files each of which has the amount of metadata. Needless to say, this is currently very slow. We are trying various kinds of optimizations around lazy loading of metadata but the limiting factor will be HDF5. A similar situation is eventually going to occur here, so thinking about various optimizations is important.

alimanfoo commented 5 years ago

Depends on what manageable means, I suppose. We have situations where projects are trying to load a small part of the metadata from thousands of files each of which has the amount of metadata. Needless to say, this is currently very slow. We are trying various kinds of optimizations around lazy loading of metadata but the limiting factor will be HDF5. A similar situation is eventually going to occur here, so thinking about various optimizations is important.

That's helpful to know.

FWIW the consolidated metadata feature currently in zarr python was developed for the xarray use case, where the need (as I understand it) is to load all metadata up front. So that feature combines the content from all .zarray, .zgroup and .zattrs objects from the entire group and dataset hierarchy into a single object, which can then be read from object storage in a single HTTP request.

If you have use cases where you have a large amount of metadata but only need to read parts of it at a time, that obviously might not be optimal. However, 14MB is not an unreasonable amount to load from object storage, would probably be fine to do interactively (IIRC bandwidth to object storage from compute nodes within the same cloud is usually ~100MB/s).

I'm sure there would be other approaches that could be taken too that could support partial/lazy loading of metadata. Happy to discuss at any point.

jakirkham commented 5 years ago

Are you able to provide data on where most of the time is being spent, @DennisHeimbigner?

DennisHeimbigner commented 5 years ago

Issue: Attribute Typing I forgot to address one important difference between the netcdf-4 model and Zarr: attribute typing. In netcdf-4, attributes have a defined type. In Zarr, attributes are technically untyped, although in some case it is possible to infer a type from the value of the attribute.

This is most important with respect to the _FillValue attribute for a variable. There is an implied constraint (in netcdf-4 anyway) that the type of the attribute must be the same as the type of the corresponding variable. There is no way to guarantee this for Zarr except by doing inferencing.,

Additionally, if the variable is of a structured type, there is currently no standardized way to define the fill value for such a type nor is there a way to use structured types with other, non-fillvalue, attributes.

Sadly, this means that NCZarr must add yet another attribute that specifies the types of other attributes associated with a group or variable.

alimanfoo commented 5 years ago

Hi @DennisHeimbigner,

Regarding the fill value specifically, the standard metadata for a zarr array includes a fill_value key. There are also rules about how to encode fill values to deal with values that do not have a natural representation in JSON. This includes fill values for arrays with a structured dtype. If possible, I would suggest to use this feature of standard array metadata, rather than adding a separate _FillValue attribute. If not, please do let us know what's missing, that would be an important piece of information to carry forward when considering spec changes.

Regarding attributes in general, we haven't tried to standardise any method to encode values that do not have a natural JSON representation. Currently it is left to the application developer to decide their own method for encoding and decoding values as JSON, e.g., I believe xarray has some logic for encoding values in zarr attributes. There has also been some discussion of this at #354 and #156.

Ultimately it would be good to standardise some conventions (or at least define some best practices) for representing various common value types in JSON, such as typed arrays. I'm more than happy for the community to lead on that.

DennisHeimbigner commented 5 years ago

This reference -- https://zarr.readthedocs.io/en/stable/spec/v2.html#fill-value-encoding -- does not appear to address fill values for structured types. Did you get the reference wrong?

alimanfoo commented 5 years ago

If an array has a fixed length byte string data type (e.g., "|S12"), or a structured data type, and if the fill value is not null, then the fill value MUST be encoded as an ASCII string using the standard Base64 alphabet.

I.e., use base 64 encoding.

DennisHeimbigner commented 5 years ago

So it would be nice if we had a defined language-independent algorithm that defines how to construct the fill value for all possible struct types (including recursion for nested structs). This should be pretty straightforward/ Also, why force a string (base64) encoding? Why not make the fill value be just another Json structure? It worries me how python-specific is much of the spec around types.

alimanfoo commented 5 years ago

So it would be nice if we had a defined language-independent algorithm that defines how to construct the fill value for all possible struct types (including recursion for nested structs). This should be pretty straightforward

That would be good. I believe numpy mimics C structs, further info here.

Looking again at the numpy docs, there is support for an align keyword when constructing a structured dtype, which changes the itemsize and memory layout. This hasn't been accounted for in the zarr spec, I suspect that things are currently broken if someone specifies align=True (default is False).

Also, why force a string (base64) encoding? Why not make the fill value be just another Json structure?

That's a nice idea, would fit with the design principle that metadata are human-readable/editable.

It worries me how python-specific is much of the spec around types.

The zarr spec does currently defer to numpy as much as possible, assuming that much of the hard thinking around things like types has been done there already.

If there are clarifications that we could make to the v2 spec that would help people develop compatible implementations in other languages then I'd welcome suggestions.

Thinking further ahead to the next iteration on the spec, it would obviously be good to be as platform-agnostic as possible, however it would also be good to build on existing work rather than do any reinvention. The work on ndtypes may be relevant/helpful there.

alimanfoo commented 4 years ago

Surfacing here notes on the NetCDF NCZarr implementation, thanks @DennisHeimbigner for sharing.

alimanfoo commented 4 years ago

Also relevant here, documentation of xarray zarr encoding conventions, thanks @rabernat.

rsignell-usgs commented 3 years ago

@DennisHeimbigner: It looks like Unidata's Netcdf C library can now read data with the xarray zarr encoding conventions, right?

@rabernat, should I raise an issue for xarray to also support the Unidata NcZarr conventions?

WardF commented 3 years ago

The ability to read xarray is in the main branch, and will be in the upcoming 4.8.1 release. I am shaving the yak to get our automated regression and integration test infrastructure back up and running but we hope to have 4.8.1 out shortly.

rabernat commented 3 years ago

@rabernat, should I raise an issue for xarray to also support the Unidata NcZarr conventions?

I see this as very difficult. The reason is that the ncZarr conventions use files outside of the zarr hierarchy. We would probably need to implement a compatibility layer as a third-party package, similar to h5netcdf.

p.s. but yes, please open an xarray issue to keep track of it.

shoyer commented 3 years ago

One thing I'll note on Xarray's convention for the Zarr is that we will likely change things in the near future to always write and expect "consolidated metadata" (see https://github.com/pydata/xarray/issues/5251). This is almost completely backwards compatible, but if NcZarr writes these consolidated metadata fields in Xarray compat mode we could load these Zarr stores much quicker in Xarray.

Consolidated metadata would probably be a nice feature for NcZarr, too, because it reduces the number of files that need to be queried for metadata down to only one. I think there was a similar intent behind the .nczgroup JSON field. Consolidated metadata is sort of a super-charged version of that.

DennisHeimbigner commented 3 years ago

NCZarr get similar improvement by doing lazy read of metadata objects. That is one problem with _ARRAY_DIMENSIONS -- it requires us to read all attributes even if otherwise unneeded. NCZarr avoids this by keeping the dimension names separate. As for consolidated metadata, I assume you are NOT saying that any pure zarr container that does not contain the consolidated metadata will be unreadable by Xarray.

shoyer commented 3 years ago

NCZarr get similar improvement by doing lazy read of metadata objects. That is one problem with _ARRAY_DIMENSIONS -- it requires us to read all attributes even if otherwise unneeded. NCZarr avoids this by keeping the dimension names separate.

In Xarray, we have to read nearly all the metadata eagerly to instantiate xarray.Dataset objects.

As for consolidated metadata, I assume you are NOT saying that any pure zarr container that does not contain the consolidated metadata will be unreadable by Xarray.

This is correct, you don't need to write consolidated metadata. But if you do, Xarray will be able to read the data much faster.

As for whether netCDF users would notice a difference with consolidated metadata, I guess it would depend on their use-cases. Lazy metadata reads are great, but for sure it is faster to download a single small file than to download multiple files in a way that cannot be fully parallelized, even if they add up to the same total size.

DennisHeimbigner commented 3 years ago

faster to download a single small file than to download multiple files

true, but we have use cases where the client code is walking a large set of netcdf files and reading a few pieces of information out of each of them and where the total metadata is large (14 megabytes). This can occur when one has a large collection of netcdf files covering some time period and each netcdf file is a time slice (or slices). Perhaps Rich Signell would like to comment with his experience.

joshmoore commented 3 years ago

https://github.com/zarr-developers/zarr-specs/issues/41#issuecomment-833024978 I see this as very difficult. The reason is that the ncZarr conventions use files outside of the zarr hierarchy. We would probably need to implement a compatibility layer as a third-party package, similar to h5netcdf.

For what it's worth, I could see making some movement (June-ish?) on https://github.com/zarr-developers/zarr-specs/issues/112#issuecomment-825690209 to permit the additional files. But either way, certainly https://github.com/ome/ngff/pull/46#pullrequestreview-652899174 (related issue) would suggest hammering out a plan for this difference before another package introduces a convention.

https://github.com/zarr-developers/zarr-specs/issues/41#issuecomment-833036094 One thing I'll note on Xarray's convention for the Zarr is that we will likely change things in the near future to always write and expect "consolidated metadata" (see pydata/xarray#5251). This is almost completely backwards compatible, but if NcZarr writes these consolidated metadata fields in Xarray compat mode we could load these Zarr stores much quicker in Xarray.

Having gone through https://github.com/pydata/xarray/issues/5251 I'm slightly less worried about this then when I first read it (assuming it mean it would only support consolidated metadata), but having just spent close to 2 months trying to get dimension_separator "standardized", I'd like to raise a flag that consolidated metadata is a similar gray area. It'd be nice to get it nailed down.

rsignell-usgs commented 3 years ago

@DennisHeimbigner, just a quick comment that I too always use consolidated metadata when writing Zarr. Here's a recent example with coastal ocean model output we are publishing, where consolidated metadata is an order of magnitude faster to open:

DennisHeimbigner commented 3 years ago

Note that the issue for me is: for what use-cases is lazy metadata download better than consolidated metadata. The latter is better in the cases where you know that you need to access almost all of the meta-data or where the total size of the metadata is below some (currently unknown) size threshold. My speculation is that the access patterns vary all over the place and are highly domain dependent. I infer that Rich's use case is one where all the metadata is going to be accessed.

In any case, once .zmetadata is well-defined (see Josh's previous comment) I will be adding it to nczarr. However, we will probably give the user the choice to use it or not if lazy download makes more sense for their use-case.

On the other side, it seems to me that zarr-python might profitably explore lazy download of the metadata.

shoyer commented 3 years ago

Note that the issue for me is: for what use-cases is lazy metadata download better than consolidated metadata. The latter is better in the cases where you know that you need to access almost all of the meta-data or where the total size of the metadata is below some (currently unknown) size threshold. My speculation is that the access patterns vary all over the place and are highly domain dependent. I infer that Rich's use case is one where all the metadata is going to be accessed.

Agree! I'm sure there are cases where using consolidated metadata is not a great idea, though my guess is that they are relatively rare.

In any case, once .zmetadata is well-defined (see Josh's previous comment) I will be adding it to nczarr. However, we will probably give the user the choice to use it or not if lazy download makes more sense for their use-case.

Sounds great, thanks!

On the other side, it seems to me that zarr-python might profitably explore lazy download of the metadata.

As I understand it, this is already the case in Zarr-Python (if not using consolidated metadata). It's just that lazy metadata does not work for Xarray.

In particular, I think there is definitely a place for including an explicit "index" of arrays in a group that doesn't require potentially expensive directory listing. Hopefully this is already in the draft v3 spec (I haven't checked).

DennisHeimbigner commented 3 years ago

Can you elaborate on what you mean by this?

I think there is definitely a place for including an explicit "index" of arrays in a group that doesn't require potentially expensive directory listing.

Can you give an example?

DennisHeimbigner commented 3 years ago

Rich- are you using zarr-python directly or using xarray?

shoyer commented 3 years ago

Can you elaborate on what you mean by this?

I think there is definitely a place for including an explicit "index" of arrays in a group that doesn't require potentially expensive directory listing.

Can you give an example?

I was thinking of exactly the information you store in .nczgroup.

shoyer commented 3 years ago

Rich- are you using zarr-python directly or using xarray?

It looks like Rich is using Xarray here.

DennisHeimbigner commented 3 years ago

ok

DennisHeimbigner commented 3 years ago

I get the impression from various conversations that people do not like the nczarr convention of using extra objects like .nczarray. As an alternative, we can store the same netcdf specific metadata as extra keys inside the various zarr standard objects. Thus most of the stuff in .nczarray could be inserted into .zarray with, say, a key named _nczarr_array. My impression is that adding extra keys might be more acceptable than adding extra objects, but opinions welcome.

joshmoore commented 3 years ago

Thinking about

I get the impression from various conversations that people do not like the nczarr convention of using extra objects like .nczarray.

and

@shoyer https://github.com/zarr-developers/zarr-specs/issues/41#issuecomment-833184536 But if you do, Xarray will be able to read the data much faster.

I wonder if a .xarray file back in the day wouldn't have led to the need for consolidated metadata? i.e. are the two intertwined?

DennisHeimbigner commented 3 years ago

".xarray" -- I like it. Too bad it wasn't used. Still, it is not too late.

shoyer commented 3 years ago

I get the impression from various conversations that people do not like the nczarr convention of using extra objects like .nczarray.

I think the separate objects is at least as inherently reasonable as custom attributes. There are clearly tradeoffs to both approaches.

Of course, it would be nice if we could use a shared standard, both:

  1. for storing custom metadata needed by Zarr extensions, and
  2. for representing dimension names on arrays

I wonder if a .xarray file back in the day wouldn't have led to the need for consolidated metadata? i.e. are the two intertwined?

I wasn't involved in the discussion around adding consolidated metadata. I'm sure Xarray was one of the motivating use-cases, but performance concerns around opening lots of small files are definitely not unique to Xarray, as a search for uses on GitHub confirms. In my opinion, the concerns around performance with consolidated metadata and storing dimension names can be decoupled, so I would not favor combining them into .xarray.

If you look at the logic for consolidate_metadata(), it just gathers today metadata files found in a store: https://github.com/zarr-developers/zarr-python/blob/ab5d91b62b83214d5a2b250d830d1996bb08cc56/zarr/convenience.py#L1112

producing a JSON document that looks something like:

{
    'zarr_consolidated_format': 1,
    'metadata': {
        '.zgroup': ...,
        '.zattrs': ...,
        'array/.zarray': ...,
        'subgroup/.zgroup': ...,
        ...
    }
}

It would be pretty reasonable to use the same approach for consolidating arbitrary metadata extensions stored in separate files, including all those used by NCZarr.

joshmoore commented 3 years ago

I'd definitely be for having a clear relationship between the consolidated metadata and the extra keys (https://github.com/zarr-developers/zarr-specs/issues/112#issuecomment-825690209) but I'll point out that they are not guaranteed to be JSON, so we'd either need to detect JSON and ingest or have a naming scheme for detectable "extension" files.

rabernat commented 3 years ago

Let me first state clearly that the technical discussion here and my comments below are not intended as a personal or institutional criticism of anyone; this discussion is a necessary part of the (sometimes slow / frustrating) community open source development process. I thank everyone for engaging with such patience and feel nothing but respect for everyone in the conversation.

The need for .zmetadata is most definitely not unique to Xarray. The ability of stores to "list" a directory was assumed in the V2 spec. However, certain types of stores (e.g. vanilla HTTP) are inherently unlistable. Other stores are slow to list, such as s3 buckets with hundreds of thousands of objects or directories on certain HPC filesystems.

.zmetadata mitigates these problems in two ways:

  1. It explicitly enumerates all of the relevant Zarr metadata objects (.zarray, .zattrs, .zgroup) within a directory. This means that you never need to explicitly "list" the store.
  2. It actually duplicates the contents of those files within a nested json structure. This reduces the number of distinct read operations you need to open a dataset (to one). This is important for high-latency stores, although the problem could also be mitigated with async.

The original consolidated metadata PR (https://github.com/zarr-developers/zarr-python/pull/268) contains more useful and relevant discussion.

".xarray" -- I like it. Too bad it wasn't used. Still, it is not too late.

I don't like this idea and think it would have been a mistake. Xarray is not a data / metadata standard. It's an analysis library. Xarray's data model is explicitly based on netCDF: "[xarray.Dataset] is designed as an in-memory representation of the data model from the netCDF file format." Xarray made different choice from Unidata about how to encode netCDF into Zarr, but the end goal is the same.

I get the impression from various conversations that people do not like the nczarr convention of using extra objects like .nczarray.

I am admittedly one of those people. However, I understand completely that when developing something new, certain technical decisions have to be made in order to bootstrap a project. I am familiar with this challenge, because I am part of the group (together with @jhamman and @shoyer) who made the original decision about how to encode / decode Xarray Datasets into Zarr. Reviewing that original PR (https://github.com/pydata/xarray/pull/1528) is instructive. I have learned a lot since then, and probably would have made different choices.

At the time, we thought that the only thing we had to solve was: how to we encode named dimensions in Zarr?. We made the ad hoc choice of using the _ARRAY_DIMENSIONS key within the standard array metadata (.zattrs). A different choice would have been to create an extra metadata object (a la nczarr); however, that would have been much harder to implement because zarr-python itself would not know about such objects, so we would have had to go around zarr-python. As it is, xarray's zarr implementation sits completely on top of zarr python and never interacts directly with the lower-level stores. In general, I still think this is a good model for higher-level libraries to follow.

Hindsight is 20/20. In retrospect, a better (but slower) path, would have been to work within the Zarr community to standardize an official convention for named dimensions. Maybe it would have been an extra metadata object. I personally think it would fit well in .zarray as an optional key. Maybe it would have been _ARRAY_DIMENSIONS in .zattrs. In any case, it would have been an accepted standard.

Now with the release of nczarr, we have even more ad hoc conventions that were developed outside of the Zarr community process. For Unidata, there was never a choice to build on top of an existing base Zarr implementation, because you were writing everything from scratch to begin with, so this distinction probably seemed academic; why not just add new features to Zarr while you're at it? However, I do wonder whether it would have been best to first build a Zarr C library with feature parity with zarr-python, rather than going directly to nczarr. (Similar to the relationship between HDF5 and netCDF.)

In any case, in both cases the implementer (either Xarray devs or Unidata devs) went forward with a reasonable but unilateral technical decision about how to encode stuff on top of the base Zarr data model. It's a cliche to post this cartoon, but...

XKCD standards

The problem we all face now is: how do implementations support an expanding set of ad hoc conventions? Is it too late too deprecate some of the conventions that are out there? There are petabytes of data in the wild using the xaray _ARRAY_DIMENSIONS convention, so I think that ship has sailed. And it's not very problematic, since the data are 100% compatible with vanilla Zarr. But what about nczarr? Are any data providers serving this format yet?

Going forward, I still think the best path is to add optional named dimensions to the Zarr spec, and then have downstream data models leverage this, rather than coming up with new conventions outside of the spec. Overall, it would be great to absorb the lessons of nczarr into the V3 spec process and provide a standard, community-consensus-based way to extend the spec in an open and transparent. way.

rabernat commented 3 years ago

A final comment on process.

The nczarr development has occurred openly on GitHub via PRs: https://github.com/Unidata/netcdf-c/pulls?q=is%3Apr+nczarr. There was an opportunity almost a year ago for us (Zarr + Xarray python community) to engage with the nczarr decisions (e.g. https://github.com/Unidata/netcdf-c/pull/1769) and provide feedback before a release of nczarr. We missed this opportunity. It was a difficult year for everyone (to say the least! 😷), and a lot of balls got dropped. But going forward, let's try to find the bandwidth for this type of engagement and collaboration, particularly around the V3 spec.

DennisHeimbigner commented 3 years ago

First let me be clear that I have no problem with .zmetadata and Unidata will release in NCZarr it as soon as it is standardized. Second, it is still not too late to modify the NCZarr extensions since it is unlikely that it has been used for significant amounts of data.

DennisHeimbigner commented 3 years ago

One thing that surprises me is that the HDF5 community has not weighed in on this. Mapping HDF5 to Zarr is also going to need similar extensions as required by NetCDF-4.

rabernat commented 3 years ago

One thing that surprises me is that the HDF5 community has not weighed in on this. Mapping HDF5 to Zarr is also going to need similar extensions as required by NetCDF-4.

I don't think that HDF5 / Zarr interoperability is a priority for HDFGroup. I think they have their hands full supporting their own format and products, and they have no incentive to support Zarr.

From the Zarr perspective, we have tried hard to make Zarr-python API-compatible with h5py, so users can easily swap one for another. And we provide utilities to copy to / from HDF5.

DennisHeimbigner commented 3 years ago

New Representation of NCZarr extended metadata

After much discussion inside Unidata, and taking account of comments here, Unidata would like to propose a new representation for the NCZarr extensions to the Zarr format.

The new NCZarr extensions are stored as standard Zarr attributes in the .zattrs object. They use a naming convention to mark those attributes as NCZarr specific. The proposed naming convention is to prefix each attribute with the string _nczarr. Unrecognized attributes prefixed with this string are ignored. But developers should be aware that Unidata will probably add new attributes with this prefix as support for NetCDF-4 is enhanced.

The value of these attributes is intended to be an arbitrary JSON object. But Zarr only supports simple values or arrays of such. To bypass this constraint, an nczarr attribute's value consists of a single string value. This value is assumed to encode an arbitrary JSON object so the string value can be parsed to obtain the actual extended metadata. As a rule, the parsed value is a dictionary. Unrecognized keys are ignored to allow for future changes.

Proposed extension Attributes

The NCZarr extension metadata is stored as follows.

Extended .zgroup Attributes

_nczarr_version

The root group stores the NCZarr version number. This is stored only in the root group.

Example:

"_nczarr_version": "2.0.0"

_nczarr_group

This contains a dictionary listing the names of variables (arrays), and subgroups contained (non-transitively) in a group. It also contains both the names and sizes of dimensions defined in that group. Note that eventually, the size field will allow some mechanism for specifying that the dimension has an "unlimited" size (in the NetCDF/HDF5 sense of unlimited).

Example:

"_nczarr_group": "{
\"dimensions\": {\"d1\": \"1\", \"d2\": \"1\",...}
\"variables\": [\"v1\", \"v2\", ...]
\"groups\": [\"g1\", \"g2\", ...]
}"

Extended .zarray Attributes

_nczarr_array

This contains a dictionary listing the dimension references as fully qualified names. Ideally, we would use the XArray "_ARRAY_DIMENSIONS" convention, but it does not support fully qualified names.

As an alternate representation, the "_ARRAY_DIMENSIONS" attribute could be extended to allow the storage of fully qualified names. As long as XArray is restricted to a single, root group, there would be no conflict. That is, a multigroup Zarr file is not XArray compatible, so adapting the contents of "_ARRAY_DIMENSIONS" presents no conflict. Using this alternative would, of course, require permission from the XArray group. It also depends on what other implementations do when they encounter "_ARRAY_DIMENSIONS" in a multi-group Zarr file.

Example:

"_nczarr_array": "{
\"dimensions\": [\"/g1/g2/d1\", \"/d2\",...]
}"

_nczarr_attrs

This contains a dictionary listing the attribute types.

Example:

"_nczarr_attrs": "{
\"types\": {\"attr1\": \"<i4\", \"attr2\": \"<i1\",...}
}"

Last Revised: 5/15/2021

shoyer commented 3 years ago

The value of these attributes is intended to be an arbitrary JSON object. But Zarr only supports simple values or arrays of such.

I don't think that's the case? I agree it's a little vague in the spec, but I'm pretty sure intent is to allow for storing arbitrary JSON documents in values, e.g., I'm pretty sure this in valid in the Zarr Python client. "Simple" is not a subset of JSON, it's a description of JSON itself 😀

Assuming I'm right I would definitely suggest storing normal JSON documents. This is much more readable than JSON encoded strings.

DennisHeimbigner commented 3 years ago

Thanks for the info. If and when the spec clarifies, then it is worth revisiting.

shoyer commented 3 years ago

This contains a dictionary listing the dimension references as fully qualified names. Ideally, we would use the XArray "_ARRAY_DIMENSIONS" convention, but it does not support fully qualified names.

As an alternate representation, the "_ARRAY_DIMENSIONS" attribute could be extended to allow the storage of fully qualified names. As long as XArray is restricted to a single, root group, there would be no conflict. That is, a multigroup Zarr file is not XArray compatible, so adapting the contents of "_ARRAY_DIMENSIONS" presents no conflict. Using this alternative would, of course, require permission from the XArray group. It also depends on what other implementations do when they encounter "_ARRAY_DIMENSIONS" in a multi-group Zarr file.

I'm not entirely opposed to fully qualified names, but I'm not sure I agree we need them.

First, to clarify: the current version of Xarray can actually load datasets from Zarr groups, but only from a single group at a time.

As I understand the netCDF4 data model, dimensions are visible to the arrays in a group and all sub-groups. Duplicate dimensions at the same level are not allowed. It is allowed to override dimensions in a sub-group. So far, this is all compatible with unqualified dimension names, with the "nearest" name taking precedence.

Are you concerned about cases where a dimension has a different size in a sub-group, but a variable in that sub-group uses a dimension of the same name from a parent group? If this is indeed possible in netCDF-C, I sincerely hope that there are no files that use this feature out in the wild. In my mind, this violates an implicit assumption of the netCDF data model, which is that uses of a dimension name among variables in a group have the same size. These files would be incompatible with Xarray and I suspect many other tools for working with netCDF data. As far as I can tell, it is not possible to create such netCDF files using netCDF4-Python.

DennisHeimbigner commented 3 years ago

This is incorrect. The search upwards rule only applies to the CDL representation. It is perfectly legal to use any dimension for a variable, including one arbitrarily deep in some other group. So this is why FQNs are required.