Pure C implementation - Githubissues

jakirkham commented 6 years ago

Raising based on feedback in a number of different issues (xref'd below). The suggestion is to implement a pure C Zarr library. There are some questions that range from where the code should live all the way down to details about the code (licensing, build system, binary package availability, etc.). Others include scope, dependencies, etc.. Am raising here to provide a forum focused on discussing these.

xref: https://github.com/zarr-developers/zarr/issues/285 xref: https://github.com/zarr-developers/zarr/pull/276 xref: https://github.com/constantinpape/z5/issues/68

Update: https://github.com/zarr-developers/community/issues/9#issuecomment-843924311 as of 4.8.0 the NetCDF-C library does have support for Zarr: https://twitter.com/zarr_dev/status/1379875020059635713

rabernat commented 6 years ago

It will be great to have a pure C Zarr library. I really hope to see this happen.

Here is a conceptual design issue that I have not sorted out. Here is how the Zarr spec defines a storage system:

A Zarr array can be stored in any storage system that provides a key/value interface, where a key is an ASCII string and a value is an arbitrary sequence of bytes, and the supported operations are read (get the sequence of bytes associated with a given key), write (set the sequence of bytes associated with a given key) and delete (remove a key/value pair).

This is fairly language agnostic. But in practice, the idea of a "key/value interface" is closely coupled to python's MutableMapping interface. This is how all Zarr stores are implemented today from Python. I'm curious about how people imagine implementing Zarr in languages such as C that do not have this sort of abstraction.

jakirkham commented 6 years ago

Thanks @rabernat. Great point.

There are a number of ways we could imagine doing this. One might be to use a plugin architecture. This allows users to define their own particular mapping with some way to initialize, register, and destroy each plugin. Each plugin then can provide their own functions for standard operations on their specified storage type (opening/closing a storage format, getting/setting/deleting a key, etc.).

To aggregate things a bit, we can imagine that a generic struct for storage type instances for any plugin, which is defined to hold some generic values (e.g. keys, plugin ID, etc.) and some additional untyped (void*) space for other implementation specific details. Dispatching particular operations through a specific plugin could use the plugin ID to find the plugin and perform the operation. Alternatively the struct could contain function pointers to the specific plugin API functions directly.

Though there are other options. Would be interesting to hear what others think about how best to implement this.

Edit: Here's a great article that goes into more detail about plugin architecture. It also discusses C++ though the basic functionality can be emulated in C without C++.

martindurant commented 6 years ago

May be worth researching what tileDB does, which apparently can talk to various cloud back-ends also.

aparamon commented 6 years ago

Thanks for extracting this issue!

One thing not to be overlooked is memory management. It is important that plugins never have to free memory allocated inside Zarr lib, and vice versa. While usually malloc(), realloc(), and free() do operate on the same process-global heap, some legacy systems (Windows) make transferring pointer ownership through dll bounds totally unreliable.

HDF5 solution is simple but flawed: plugins making use of H5allocate_memory(), H5resize_memory(), and H5free_memory() have to be linked against HDF5 library; but the actual name/location of run-time HDF5 lib, specifically on the most problematic systems (Windows), is never reliably known in advance. (Basically, for the same reasons as why there is no system-wide glibc). So, pre-built HDF5 compression plugins are of little use, as they are still only applicable to specific HDF5 instance, and not shareable between e.g. HDFView, HDF5 tools, client applications etc.

Please, do not require Zarr plugins to link to Zarr C lib!

What is the best way to correct design is yet to be discovered, but one could e.g. pass pointer to free() as a part of plugin API struct.

Another HDF5 C API problem that was at some point considered serious, but proved a non-issue practically: C struct layout is in fact dependent on the compiler/compiler options. In reality, default struct ABI is compatible throughout all popular compilers (gcc, Clang, MSVC), and even more exotic beasts like Embarcadero Delphi (where structs are called records). Still, it wouldn't harm if Zarr struct layout/ABI is completely specified wherever structs are used in API publicly.

rabernat commented 6 years ago

The folks at quantstack have been working very hard at writing numerical array stuff in C++, xtensor for example. They might have some good ideas about how to approach this. They might even want to work on it! ;)

Tagging @SylvainCorlay and @maartenbreddels to see if they want to weigh in.

SylvainCorlay commented 6 years ago

Hi @rabernat thanks for the heads up!

Looking at zarr, an xtensor-based implementation of the spec seems like a very natural thing to do. xtensor expressions can be backed by any data structure or in-memory representations (or even filesystem operations or database calls). We would love to engage with you on that.

It turns out that there already exists a project involving xtensor and zarr by @constantinpape called z5py. It is not from the xtensor core team but Constantin has been engaging with us on gitter and in GitHub issues and PRs.

It could be interesting to discuss this in greater depth!

We are holding a public developer meeting for xtensor and related projects this Wednesday (5pm Paris time, 11am EST, 8am PST). Would you like to join us?

cc @wolfv @JohanMabille

constantinpape commented 6 years ago

Hey,

just to chime in here; as @SylvainCorlay mentioned, z5 / z5py (https://github.com/constantinpape/z5) is using xtensor as multiarray (and for python bindings). z5 implements (most of) the zarr spec in C++ and z5py provides python bindings to it.

We also have an issue on C bindings (see https://github.com/constantinpape/z5/issues/68) with some details discussed already.

If you are considering to base the zarr C bindings on this:

I would not mind handing ownership to some other organization (zarr_developers or something else) if this is a concern (as long as I still have contributor rights)
I don't have much time right now to contribute a lot to implementing the C bindings, but I would happily assist anyone who would tackle this.

SylvainCorlay commented 6 years ago

@constantinpape @jakirkham is this something you would like to discuss at the xtensor meeting?

constantinpape commented 6 years ago

@SylvainCorlay Unfortunately, I cannot make it today. I am organising a course next week and have a lot to do setting things up. Will have more time again in November and can spend a bit more time on these things then.

SylvainCorlay commented 6 years ago

@constantinpape we should arange a skype meeting. I would love to dive into the internals of z5!

constantinpape commented 6 years ago

Sounds great! I have a pretty packed schedule for the rest of this week and next week, but I might be able to squeeze something in. I will contact you on gitter later and we can discuss details then.

rabernat commented 6 years ago

Forgive me for asking an ignorant question...

For our purposes here, is C++ the same a C? Z5 already implements zarr in C++. Is an additional C implementation still necessary?

aparamon commented 6 years ago

It's Ok as long as the public interface is declared with extern "C" specifiers: such a library would be universally usable from Lisp, Pascal, Go etc. External C++ API is not universally usable.

dopplershift commented 6 years ago

Another point: a pure C implementation only needs a C compiler, not a C++. This might be a consideration for simplifying the build environment library clients. I’m thinking specifically about netcdf-c (ping @WardF)

jakirkham commented 6 years ago

Thanks all for joining the conversation. Sorry for being slow to reply (was out sick most of last week).

Part of the reason I raised this issue is feedback we got from PR ( https://github.com/zarr-developers/zarr/pull/276 ) specifically this comment (though there may have been other comments that I have forgotten), which suggested that we need to supply a pure C implementation of the spec (not using C++). Without knowing the details about the motivation, I can't comment on that (hopefully others can fill this in).

Though there are certainly a few reasons I can think of that one would want a pure C implementation. These could vary from providing an easily accessible API/ABI, working on embedded systems, interfacing with other libraries, portability, easy to build language bindings, organizational requirements, etc.

Now there is certainly a lot of value in having a C++ implementation (and a very effective one at that), which I thinks @constantinpape and others have demonstrated. The C++ ecosystem is very rich and diverse; making it a great place to explore a spec like Zarr.

Certainly it is possible to wrap a C++ library and use extern C, which we were discussing in issue ( https://github.com/constantinpape/z5/issues/68 ). Though it sounds like that doesn't meet the needs of all of our community members. If that's the case, going with C should address this. Not to mention it would be easy for various new and existing implementations to build off of the C implementation either to quickly bootstrap themselves or lighten their existing maintenance burden. So overall I see this as a win for the community.

WardF commented 6 years ago

@dopplershift and others are correct, for this to be something usable for netCDF, a pure C library is required. I suspect there are other projects out there which would benefit as well. The C++ interface is fantastic but wouldn't work for our needs.

jakirkham commented 6 years ago

Do you have any thoughts on the technical design of this implementation, @WardF? Was thinking about using plugins to handle different MutableMapping-style implementations at the C-level. The same could probably be used for handing Codecs for compression/decompression. Does that sound reasonable or do you have different ideas on the direction we should go?

WardF commented 6 years ago

No, but after discussion with @dennisheimbigner I think we need to sketch out what is needed from the netCDF point of view, infer what would be needed from a more broad point of view, and see what the intersection is. I'll review the plugins link/guide that you linked to, thanks! I hadn't had a chance to review it yet. The focus of discussion on on our end internally have been around the intersection of the data model and an I/O API; I'll let Dennis make his own points here, but he has pointed out several things I hadn't previously considered.

To answer your question, however, plugins do seem like a reasonable approach. In terms of using Codecs for compression/decompression, I infer it would be something similar to how HDF5 uses compression/decompression plugins for compressed I/O? That would make sense in broad terms; there are considerations in regards to the netCDF community and what 'core' compression schemes to support, but from a technical standpoint it seems like a reasonable, path-of-lesser-resistance approach.

I will follow up with any additional thoughts on the technical design as they occur to me :).

DennisHeimbigner commented 6 years ago

With respect to compression plugins. I would push for using the actual HDF5 filter plugin mechanism. That way we automatically gain access to a wide variety of compressors currently supporting HDF5. The HDF5 compression API is pretty generic. A (de-)compressor assumes that it is given a buffer of data and returns a buffer containing the (de-) compressed data.

DennisHeimbigner commented 6 years ago

A question. I hear the term "IO API", but I cannot figure out what that term means. Can someone elaborate?

aparamon commented 6 years ago

With respect to compression plugins. I would push for using the actual HDF5 filter plugin mechanism. That way we automatically gain access to a wide variety of compressors currently supporting HDF5. The HDF5 compression API is pretty generic. A (de-)compressor assumes that it is given a buffer of data and returns a buffer containing the (de-) compressed data.

It is one appealing possibility, but please also be aware of HDF5 compression API drawbacks.

DennisHeimbigner commented 6 years ago

I fail to see this as much of a problem from the netcdf-c point of view. If we build without hdf5, then we need to build in replacements for H5allocate_memory(), H5resize_memory(), and H5free_memory(), which seems easy enough. Also, I do not understand this comment: "Please, do not require Zarr plugins to link to Zarr C lib" Since this is a question of dynamic libraries, I do not see the difficulties.

aparamon commented 6 years ago

@DennisHeimbigner Ah, it means you are lucky enough to not possess the peculiarities/ugliness of "Windows way" :-)

Let's have HDF5 as the example. On GNU/Linux filter plugins are easy enough: you build them against the /usr/lib/libhdf5.so, drop the binary to HDF5_PLUGIN_PATH, and all the software (h5cat, HDFView, your applications) starts immediately to understand your compressed data. The plugin is universal: one binary fits all apps.

On my Windows laptop, I now have:

C:\Program Files (x86)\HDF_Group\HDF5\1.8.13\bin\hdf5.dll (32-bit, 1.8)
C:\Program Files (x86)\HDF_Group\HDF5\1.8.20\bin\hdf5.dll (32-bit, 1.8)
C:\Program Files\HDF_Group\HDF5\1.8.20\bin\hdf5.dll (64-bit, 1.8)
C:\Program Files\HDF_Group\HDFView\3.0.0\lib\hdf5_java.dll (64-bit, ???)
C:\Program Files\LLNL\VisIt 2.13.1\hdf5.dll (64-bit, ???)
C:\ProgramData\Anaconda3\Lib\site-packages\h5py\hdf5.dll (64-bit, 1.10)
C:\ACD 2017\hdf5.dll (32-bit, 1.8)
C:\ACD 2017\hdf5-1.10.dll (32-bit, 1.10) ... and the variety is only bounded by software authors' imagination.

To what dll shall I link my plugin? Any answer will be sub-optimal, as it will reliably teach only that one program to understand your compressed data (and for C:\ACD 2017 it's even worse: note the two HDF5 dll flavors actually used from the same process!).

This problem would not be present if plugin dlls didn't link to HDF5 dll.

WardF commented 6 years ago

@aparamon Thank you for the illustrative example; this is something we will have to keep in mind, given the crossplatform nature of netCDF.

jakirkham commented 6 years ago

What if we had an environment variable that acted as a plugin search path? Think all the major platforms have some C API for loading libraries at runtime.

DennisHeimbigner commented 6 years ago

I see. You are correct. From the point of view of zarr embedded in netcdf-c we only need to compile the filter against netcdf-c library.

DennisHeimbigner commented 6 years ago

I think HDF5 has such an variable named something like HDF5_PLUGIN_PATH? What I wish was the case was that the HDF5 filter struct like below had field(s) for memory allocation callbacks. const H5Z_class2_t H5Z_BZIP2[1] = {{ H5Z_CLASS_T_VERS, / H5Z_class_t version / (H5Z_filter_t)H5Z_FILTER_BZIP2, / Filter id number / 1, / encoder_present flag (set to true) / 1, / decoder_present flag (set to true) / "bzip2", / Filter name for debugging / (H5Z_can_apply_func_t)H5Z_bzip2_can_apply, / The "can apply" callback / NULL, / The "set local" callback / (H5Z_func_t)H5Z_filter_bzip2, / The actual filter function / }};

aparamon commented 6 years ago

@DennisHeimbigner You are correct, HDF5 has HDF5_PLUGIN_PATH (link), and additionally H5PLprepend, H5PLappend, H5PLinsert etc in the later versions. That part works good.

Having memory allocation callbacks in H5Z_class_t doesn't seem to help, because multiple allocations/deallocations may be required during decompression of a single chunk (filter pipelines). The next filter must be able to realloc()/free() memory malloc()ed by previous filters. Instead, the library could provide its "universal" allocation callbacks into every H5Z_func_t call. Hopefully, the re-allocations are rare, so performance will not suffer much from inability to inline malloc()/realloc()/free().

Please note that optional (and rarely used) H5Z_can_apply_func_t, H5Z_set_local_func_t procedures suffer from the same principal architectural drawback: they require to back-link to proper library in order to make use of dcpl_id, type_id, space_id. On Windows, that just doesn't reliably work.

The mutual-reference architecture used in HDF5 seems rather un-elegant, but on *NIX systems it works fine almost always, due to single instance of libhdf5.so typically present in the system. On Windows -- no, it's never the case. More elegant architecture would be only to link to plugins from the library and not vice versa. The library should directly tell the plugin all required information without the need for additional call-backs.

The question is whether proper Windows support is worth the pain? As a one data point, for my company it would still be desirable, but much less so than say 3..5 years earlier.

DennisHeimbigner commented 6 years ago

You are correct, I had it backwards. The filter needs to be given a table of memory allocation/free functions to use. As for windows support. For us (netcdf) this is almost essential since we have a fair number of windows users and I would be very reluctant to cut them out.

DennisHeimbigner commented 6 years ago

It appears to be the case that many (most) contributed HDF5 filters are on github in source form. So if we are forced to rebuild them, and we assume we can have some kind of wrapper for them if we need it, then the question might be: what is the minimal set of source changes we need to make to a filter's source to solve the memory allocation problem. Also, is there a wrapper that would help this process?

aparamon commented 6 years ago

@DennisHeimbigner As for existing HDF5 compression filters, I'm not sure there is an efficient remedy, as the principal architecture/interface is flawed. I.e., it's possible to go from malloc()/realloc()/free() calls to H5allocate_memory()/H5resize_memory()/H5free_memory(); but on Windows where it's really important we just exchange the problem of finding correct MSVC runtime for the problem of finding the HDF5 library, -- which is equivalently ill-posed. (That's why I'm hesitant to fix my own report https://github.com/aparamon/HDF5Plugin-Zstandard/issues/2 this way.)

Re-using existing numerous HDF5 filters is a valuable benefit, albeit my experience suggests that wrapping a compression algorithm with a filter is not dramatically hard. The more robust, elegant interface seems a bigger long-term win. (Disclaimer: that's a personal opinion not a fact.)

Also, in my initial comment I considered a broader class of plugins, including possibly storage implementations. Memory management issue should not be overlooked for those too!

jakirkham commented 6 years ago

Agree adding some cookie cutter C code for people to wrap up HDF5 compression filters would be a good thing to have around.

Personally was thinking initial compression support could just be wrapping up Blosc (much as the Python Zarr does) as there is a C implementation and it has a way to manage multiple compressors that have been useful so far. Though there are also libraries like Squash that provide a host of compression algorithms as well as it's own plugin system, which could be interesting to explore.

Though would think compression should be our second priority. An initial implementation that is able to simply write to disk using a primitive key-value store plugin system could really bolster our confidence and provide something nice to iterate on.

liangwang0734 commented 5 years ago

HDF5 C++ has awkward syntax. I hope Zarr C/C++ will have a nicer interface for basic operations, hopefully thread-safe.

I'm also a little concerned about choosing c over c++. It might introduce unnecessary complexity IF zarr c wants to mimic some object-oriented paradigm.

jakirkham commented 5 years ago

There is a C++ implementation already, z5, which you might be interested in, @liangwang0734.

DennisHeimbigner commented 5 years ago

I am currently working on the insertion of C-Language Zarr support into Unidata's netcdf-c library. The goal is to provide an extended Zarr that will more closely support the existing netcdf-4 (aka netcdf enhanced) data model.

We also intend to support two forms of interoperability with existing Zarr data and implementations.

An existing Zarr implementation can read an NCZarr dataset.
NCZarr can provide read-only access to existing standard Zarr datasets.

I have created a draft document that is an early description of the extensions of Zarr to what I am calling NCZarr. Ideally, this could eventually become some kind of official standard Zarr extension. It also describes how we propose to represent existing Zarr datasets (our second goal).

The document is currently kept here: https://github.com/Unidata/netcdf-c/blob/cloud.tmp/docs/nczextend.md It is currently in Doxygen markdown format, so there may be minor display glitches depending on your viewer.

jakirkham commented 5 years ago

Thanks for the update, @DennisHeimbigner. Will mull over this a bit. Have a few questions to start us off.

Could you please share briefly in what ways the NCZarr spec differs (a very rough overview/executive summary here is fine as people can go read the spec for details)? Do these changes overlap with what other groups are looking for (e.g. https://github.com/zarr-developers/zarr/issues/333 and https://github.com/NeurodataWithoutBorders/pynwb/issues/230 or others)? Were there specific pain points you encountered and/or areas, in which you were hoping Zarr could grow? It may even make sense to break these out into a series of GitHub issues that we can discuss independently. Though feel free to add them here first if that is easiest.

Also would be good to hear a bit more about how you are handling things like different key-value stores and compression algorithms. Are users free to bring their own and if so how? Will there be any of either that are preincluded?

DennisHeimbigner commented 5 years ago

Could you please share briefly in what ways the NCZarr spec differs (a very rough overview/executive summary here is fine as people can go read the spec for details)?

I guess I had hoped this document would serve as that diff document :-) I have additional documents giving additional characterizations of NCZarr, but they are not ready for prime time yet.

Do these changes overlap with what other groups are looking for (e.g. #333 and NeurodataWithoutBorders/pynwb#230 or others)?

Frankly I do not know in detail. My current goal is to get as close to the netcdf-4 data model as I can while maintaining a large degree of interoperability with the existing Zarr v2 spec. Analysis and comparison with the other proposed extensions is important, but probably should be a separate document.

Were there specific pain points you encountered and/or areas, in which you were hoping Zarr could grow?

There are two such "pain points"

The Zarr spec does not conform the "write narrowly, read broadly" heuristic in that it says that any annotations not specified in the Zarr spec are prohibited. It preferably should say that unrecognized keys/objects/etc should be ignored.
From the Unidata point of view, the inability to represent variable length items is a significant problem. My discussion of handling of variable length strings shows, I think, the difficulties.

It may even make sense to break these out into a series of GitHub issues that we can discuss independently. Though feel free to add them here first if that is easiest.

I considered starting a new issue, but do not want to pollute the issue space too much.

Also would be good to hear a bit more about how you are handling things like different key-value stores and compression algorithms. Are users free to bring their own and if so how? Will there be any of either that are preincluded?

I have separate internal architecture documents where I am describing how I propose to deal with those issues. But roughly, we emulate the existing Zarr implementation in providing an internal API that separates the key-value store from the core NCZarr code. I am currently basing it loosely on the Python MutableMapping API.

In one of these issues, we discussed the Filter problem. My current thinking is to provide an architecture similar to that provided by HDF5. There are known problems with this, so I expect that we will need to provide some extended form of the HDF5 approach. However, I have as a goal the ability to use existing HDF5 filters without change (this may not be possible).

alimanfoo commented 5 years ago

Thanks Dennis.

Just wanted to mention that zarr does support variable length strings, as well as variable length sequences of atomic types. See the sections on string arrays and object arrays in the tutorial.

For variable length string arrays I would recommend using the VLenUTF8 encoding, as it should be simplest to implement in plain C. IIRC it is basically the same as parquet encoding.

On Fri, 11 Jan 2019, 20:42 Dennis Heimbigner <notifications@github.com wrote:

I am currently working on the insertion of C-Language Zarr support into Unidata's netcdf-c library. The goal is to provide an extended Zarr that will more closely support the existing netcdf-4 (aka netcdf enhanced) data model.

We also intend to support two forms of interoperability with existing Zarr data and implementations.

An existing Zarr implementation can read an NCZarr dataset.

NCZarr can provide read-only access to existing standard Zarr datasets.

I have created a draft document that is an early description of the extensions of Zarr to what I am calling NCZarr. Ideally, this could eventually become some kind of official standard Zarr extension. It also describes how we propose to represent existing Zarr datasets (our second goal).

The document is currently kept here: https://github.com/Unidata/netcdf-c/blob/cloud.tmp/docs/nczextend.md It is currently in Doxygen markdown format, so there may be minor display glitches depending on your viewer.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/zarr-developers/zarr/issues/317#issuecomment-453650594, or mute the thread https://github.com/notifications/unsubscribe-auth/AAq8QsuHrf39Ta7ExzdYZY9kxwHpN5hwks5vCPc4gaJpZM4XztI9 .

alimanfoo commented 5 years ago

The Zarr spec does not conform the "write narrowly, read broadly" heuristic in that it says that any annotations not specified in the Zarr spec are prohibited. It preferably should say that unrecognized keys/objects/etc should be ignored.

Apologies I haven't yet read the NCZarr spec carefully, sorry if I've misread anything, but I though it might be worth a couple of comments here.

For an array at logical path "foo/bar", core metadata for the array are stored as a JSON document under the key "foo/bar/.zarray", and attributes are stored as a separate JSON document under the key "foo/bar/.zattrs".

As an extension developer, you are free to use the attributes (i.e. the .zattrs JSON document) however you like. You can put whatever you like in there. This is the natural place to put any netcdf information, in an analogous way to how netcdf4 uses hdf5 attributes.

Similarly for the group at logical path "foo", there is a JSON document under the key "foo/.zgroup" for core metadata, and another JSON document under "foo/.zattrs" for attributes. Again, you are free to use the attributes (.zattrs) document however you like.

I.e., the Zarr spec constrains the contents of .zarray and .zgroup documents, but does not place any constraint on .zattrs. So you should be able to do whatever you need with .zattrs, without requiring a spec change.

Also I saw your doc proposes putting some extra metadata in separate JSON documents under other keys like .zdims and .ztypedefs. There is nothing wrong with this, perfectly legal according to the standard zarr spec as is, these keys would just get ignored by a standard zarr implementation. However you could put everything into .zattrs, in which case it will be visible as attributes via standard zarr APIs. I would suggest using .zattrs unless you have a really strong reason not to.

Hope that makes sense, please feel free to follow up if anything isn't immediately clear.

DennisHeimbigner commented 5 years ago

I must disagree. I am going from the spec. I consider the tutorial irrelevant.

these keys would just get ignored by a standard zarr implementation Again, I must disagree. As I read the spec. this is currently specifically disallowed.

I have considered putting some of my extensions as attributes, so that is still an open question. are not attributes in the netcdf-4 sense. T

dopplershift commented 5 years ago

@alimanfoo Is there room to codify some of those items within the spec? Or is there something being misunderstood here?

DennisHeimbigner commented 5 years ago

Well we are all feeling our way on this because we are in a complex design space, so nothing is set in stone.

jakirkham commented 5 years ago

Sorry @DennisHeimbigner I must have missed it. What was the disagreement?

rabernat commented 5 years ago

This NCZarr spec seems like a great development! Surely lots of people will have ideas and opinions (pinging @shoyer and @jhamman who have previously weighed in on this, e.g. DOC: zarr spec v3: adds optional dimensions and the "netZDF" format #276)

Regarding shared dimensions, we have already done something with this in an ad-hoc way in xarray. We chose to add an attributed called .ARRAY_DIMENSIONS to zarr arrays, which just lists the dimensions as a list (e.g. ['time', 'y', 'x']). This was all we needed get zarr to work basically like netcdf (as far as xarray is concerned). https://github.com/pydata/xarray/blob/master/xarray/backends/zarr.py#L14

I know this isn't part of any spec. We just did it so we could get xarray to work with zarr. It may well turn out that this data can't be read properly by NCZarr (the price we pay for moving forward without a spec), but I thought I would at least mention this in hopes of some backwards compatibility.

aparamon commented 5 years ago

Thank you @DennisHeimbigner for bringing it up!

Do I understand correctly that in order to locate a type definition, client is expected to walk up the directory/group structure parsing .ztypedefs (if present) there, until the required type definition is found? If so, the design seems flexible and reasonable, although not clear to which extent compatible with Zarr logical paths (see N.B.).

From reading Zarr spec it seems that unlike unknown attributes, unknown files (e.g. .ztypedefs) at any dir are legal. Could Zarr devs please confirm?

alimanfoo commented 5 years ago

I must disagree. I am going from the spec. I consider the tutorial irrelevant.

I'm guessing this is regarding storage of variable length strings (and other objects). The zarr Python implementation supports an object ('O') data type, but going back to the spec I see this is not mentioned anywhere. Apologies, this is an omission. The spec should state that an object ('O') data type is supported, and that this should be used for all variable length data types.

When encoding an array with an object data type, there are various options for how objects are encoded. If objects are strings then my recommendation would be to use the VLenUTF8 encoding, defined in numcodecs.

these keys would just get ignored by a standard zarr implementation

Again, I must disagree. As I read the spec. this is currently specifically disallowed.

I think the spec is reasonably clear about the fact that, within the .zarray metadata object, only certain keys are allowed. However, within the .zattrs metadata object you can use any key you like. And within the store, you can store other data under other keys like .zdims or whatever. However, I would still encourage using .zattrs for any extension metadata. This is what the xarray to_zarr() function does.

Sorry for brevity, happy to expand on any of this if helpful.

alimanfoo commented 5 years ago

I think the spec is reasonably clear about the fact that, within the .zarray metadata object, only certain keys are allowed. However, within the .zattrs metadata object you can use any key you like. And within the store, you can store other data under other keys like .zdims or whatever. However, I would still encourage using .zattrs for any extension metadata. This is what the xarray to_zarr() function does.

So just to be clear, the word "key" here is being used in two different ways. There are keys within the .zarray metadata objects. And there are keys that are used to store and retrieve data from the store.

alimanfoo commented 5 years ago

From reading Zarr spec it seems that unlike unknown attributes, unknown files (e.g. .ztypedefs) at any dir are legal. Could Zarr devs please confirm?

The zarr spec constrains what keys you are allowed to use within .zarray and .zgroup metadata objects. But you can use any key you like within .zattrs metadata objects.

And you can store other objects using store keys like ".ztypedefs", i.e., ".zarray" and ".zgroup" are reserved store keys, and you don't want to clash with chunk keys, but otherwise you can store data under any key you like. Although I would still recommend using .zattrs for metadata wherever possible.

Hth.

shoyer commented 5 years ago

I have also been confused about how zarr supports variable length strings. I know the Python library can do it, but how such data is stored is not at all clear from reading the spec alone.

shoyer commented 5 years ago

Similarly to what I wrote in #276, I would prefer for both dimensions and named type definitions to be standardized for zarr, as optional metadata fields. NetCDF imposes some high level consistency requirements, but at least dimension names (without consistency requirements) are universal enough that they could be a first class part of zarr's data model. Potentially these specs could be layered, e.g.,

Base zarr spec
Standard zarr extensions: a. dimensions spec b. named types spec
NetCDF spec (1+2)

The reason why I'm advocating for laying these specs is that I think there's significant value in standardizing additional optional metadata fields. I think it would be much harder to convince non-geoscience users to use a full netCDF spec, but named dimensions alone would make their data much more self-described.

We don't necessarily need to store these fields in .zarray, but I do like keeping them out of .zattrs to avoid name conflicts. We could also officially reserve some names in .zattrs for spec specific metadata, e.g., all names starting with .. The convention might be that names starting with . are "hidden" attributes, and should match the name of specification, e.g., we'd use .netcdf for netcdf specific metadata.

The Zarr spec does not conform the "write narrowly, read broadly" heuristic in that it says that any annotations not specified in the Zarr spec are prohibited. It preferably should say that unrecognized keys/objects/etc should be ignored.

I agree that the Zarr spec should be updated in this way. In practice, it's hard for me to imagine a sensible implementation not adhering to this spec, but I still think it's good not to preclude the possibility of future (backwards compatible) extensions.

From the Unidata point of view, the inability to represent variable length items is a significant problem. My discussion of handling of variable length strings shows, I think, the difficulties.

Agreed that this is important. Given that we'll need to increment the Zarr spec to include discussion of the O dtype, this may be a good time to address other technically backwards incompatible changes (such as explicitly stating that unrecognized keys should be ignored).

zarr-developers / community

Pure C implementation #9