Vlen in specification - Githubissues

JamiePringle commented 1 year ago

Dear Zarr community --

I am a physical oceanographer doing fairly large data work in an interdisciplinary setting. Many of the folks with whom I want to share my data use R, and thus cannot easily access Zarr. One way around this would be to have netCDF read zarr.

I am working with the netCDF developers. I have run into a roadblock with the specification of ragged arrays. The details are in this netCDF issue: https://github.com/Unidata/netcdf-c/issues/2516 . Essentially, in Zarr to implement ragged arrays, the tutorial specifies an object_codec from numcodecs. This is not currently supported in netCDF because the "object_codec=numcodecs.VLenArray(int)" is only mentioned in the tutorial, and not in the formal Zarr 2 specifications. They are resistant to supporting mechanisms outside of the specification. They are also worried that the particular implementation of the VlenArray() mechanism is too Python specific.

I am agnostic to how the VlenArray() mechanism is implemented. But it would be helpful if it were specified formally, so other partners could implement it a way that is comfortable for them.

Jamie Pringle University of New Hampshire

jbms commented 1 year ago

In fact, none of the filters or codecs are described in the specification, the only reference for them is their source code.

There is potentially room for improvement over the vlen array support in zarr-python, but I think just adding support in netcdf-c for the existing vlenarray codec supported by zarr-python, using the source code as a reference, would also be reasonable.

JamiePringle commented 1 year ago

@jbms You are right -- and this is a major impediment to the netCDF folks implementing the ability to read Zarr. And this ability is important to making Zarr useful in interdisciplinary settings where just asking people to use Python is not always useful (e.g. the R community). It may not be hard for us to bop over to another language to do a little something, but for many people it is.

And you are right, the netCDF people could be less formal, and just do something that matches your code. But they have their own sprawling code base, with many historical legacies already entailed in it. They feel this technical debt keenly, and don't wish to add more. For more on their world view, see this issue: https://github.com/Unidata/netcdf-c/issues/2484

In both these cases, the key issue is the difference between what Zarr does in python, and what the specification says, especially with respect to filters.

Regardless, it would be useful to the scientific community if these two strong projects worked well together, and having a specification for the Vlen array would help make this happen.

Jamie

jbms commented 1 year ago

To be clear, zarr-python isn't my code. I develop two other zarr implementations (https://github.com/google/neuroglancer and https://github.com/google/tensorstore), so I'm definitely interested in interoperability of implementations. (Note that neither of those implementations currently supports vlenarray.)

I certainly think it would be helpful to write up specifications for all of the codecs and filters, and if you want you could start with vlenarray.

joshmoore commented 1 year ago

I'd certainly also support capturing some of the Python-specific implementation details within the v2 spec. We'll likely need to add the appropriate caveats and explain the historical situation. What we don't want to do is derail the v3 effort by re-opening discussions around the v2 spec (as I did with dimension_separator 😔)

JamiePringle commented 1 year ago

I am in this for the long game -- if you can re-assure the folks from netCDF (who as I understand it, are participating in the v3 discussion?) that this will be in v3, then that might be sufficient. I want to move to a world where my R using colleagues can use my output, and that I don't always have to translate my output to HDF or the like to share it.

@WardF is noted on this so he sees this discussion.

Thanks Jamie

joshmoore commented 1 year ago

if you can re-assure the folks from netCDF...

I certainly understand their desire to have a spec rather than depending on the code.

I am in this for the long game

Ditto :smile:

I want to move to a world where my R using colleagues can use my output

:heart:

jbms commented 1 year ago

It certainly could be nice to support in v3, but it has not really been discussed yet in the context of v3. In v2 it is represented with a dtype of "O" (Python object) and a particular filter. Conceptually this isn't great for cross-language compatibility, but I don't think it poses any particular implementation difficulties. In v3 I would like to avoid having the equivalent of dtype "O" (in practice for languages other than Python it just means "do something special here") -- instead it would make more sense to specifically standardize variable-length arrays as data types, but producing a reasonable specification there would surely take more work.

WardF commented 1 year ago

I agree that adding it to be v3 would be better than changing the published v2 spec. More broadly, I'd suggest that any changes to a published spec be versioned, e.g. 'v2.1, v2.2', etc. With the idea that the spec provides a roadmap for independent, Zarr-compliant implementations, it would be logistically difficult for software that claims to be 'v2 compliant' to suddenly not be.

This is an issue we've encountered in the past with netCDF, and have had to adopt additional add-on style specs to the 'classic' netCDF data model/spec, as new functionality was added (such as CDF5 support).

jbms commented 1 year ago

None of the codecs/filters (and vlenarray is a filter) are described in the v2 spec; they are all "optionally supported".

DennisHeimbigner commented 1 year ago

If we were to implement Vlen then I believe that in effect we are extending the V2 spec. That means that the implementation has to be interoperable with all other Zarr V2 implementations. Since Vlen uses python pickling (sp?) that implies we need some kind of specification of that as well.

jbms commented 1 year ago

If we were to implement Vlen then I believe that in effect we are extending the V2 spec. That means that the implementation has to be interoperable with all other Zarr V2 implementations. Since Vlen uses python pickling (sp?) that implies we need some kind of specification of that as well.

While some zarr-python codecs do unfortunately use pickling (and their mere existence means zarr-python is not safe to use with untrusted data), fortunately vlenarray does not use pickling:

https://github.com/zarr-developers/numcodecs/blob/a3e05fd36f2cd7a2c903c4d5d9f787961a7a5905/numcodecs/vlen.pyx#L309

I don't believe any implementations beyond zarr-python support vlenarray, so I don't think you would need to worry about any quirks of other existing implementations.

I don't deny that writing a spec is a good idea, and more important for vlenarray than e.g. the gzip or blosc codecs, but there is kind of a continuum along which we have:

implementing "gzip" compressor
implementing "vlenarray" filter
adding new attribute to .zarray

joshmoore commented 1 year ago

To @WardF's and @DennisHeimbigner's valid points, I don't think I can really comment as to whether or not the cost/benefit ratio is sufficient for netcdf-c to implement. I do know though that V2 lives in an odd state where much was left unsaid in favor of using zarr-python as the de facto specification which led to a good deal of information in the wild which won't be openable by other implementations.

My suggestion would be that we work on capturing (a) an explanation of this situation that we can all live with (most likely using terms like "optional feature" or "extensions" and (b) that we at least write down what we can definitely say about the data that is in the wild.

JamiePringle commented 1 year ago

As the one who kicked this discussion off, let me provide the users perspective. As stated above, I want to share my zarr data with R users. The "easiest" way to do so would seem to be to use the netCDF package in R. This would apply to any other language which supported netCDF but not R (Fortran? Matlab? Julia?).

Unfortunately, If I create a zarr output with the stock recipes in the zarr tutorial, this does not work. By default, the zarr created uses the Blosc compressor -- this must be explicitly turned off at the creation of the dataArray if the zarr output is to be strictly compatible with standard. It must be turned off for ncdump and other netCDF tools to work, as least for now.

It is surprising that the default implementation of zarr in its most standard use case does not conform to the standard. Clearly, the default compression is good for the vast majority of use cases. This suggests that the standard might be the place for the change.

Regardless, it would be good for the community if it were easier for non-python users to access all of goodness that is zarr. Even the Julia implementation (https://github.com/JuliaIO/Zarr.jl) does not seem to include the compressors and filters which are used by default in the python version.

Thank you all, Jamie

jbms commented 1 year ago

I'm not sure what you mean as far as blosc not conforming to the standard --- none of the codecs are specified in the zarr v2 spec.

I'm a bit surprised that netcdf-c doesn't support blosc, since that is indeed the most common zarr codec, being the default one used by zarr-python. I think most other zarr implementations do support blosc.

d-v-b commented 1 year ago

none of the codecs are specified in the zarr v2 spec.

This is important. It is not a deficiency of the zarr spec that the Blosc compressor isn't very widespread, and so we probably shouldn't look to change the spec to solve that problem.

@JamiePringle is there any reason why a more common compressor (e.g., gzip) wouldn't work for your purposes?

(edit: and I can see a very good argument here for not using blosc as the default compressor in zarr-python)

DennisHeimbigner commented 1 year ago

... vlenarray does not use pickling:

Are you sure? The documentation says it is relevant and I assume that it is used to serialize the elements of the array.

jbms commented 1 year ago

... vlenarray does not use pickling:

Are you sure? The documentation says it is relevant and I assume that it is used to serialize the elements of the array.

I've never actually used that codec myself, but from looking at the implementation it appears to just be a very simple encoding of the elements, no pickling.

I think the "see also numcodecs.pickles.Pickle" in the docstring is just because a pickle representation could be a used as an alternative.

DennisHeimbigner commented 1 year ago

By default, the zarr created uses the Blosc compressor -- this must be explicitly turned off at the creation of the dataArray if the zarr output is to be strictly compatible with standard. It must be turned off for ncdump and other netCDF tools to work, as least for now.

Not sure I understand this. If you are using netcdf-c library and you have c-blosc installed and you enable filters in netcdf-c, then nczarr will properly read/write blosc compressed zarr files. Can you elaborate on what you meant by this comment?

dopplershift commented 1 year ago

The docs on ragged arrays specifically mention using numcodecs.VLenArray, which of course isn't defined in the spec. Hence back to "why specifications are one honkin' great idea"...

jbms commented 1 year ago

Here is the format used by vlen-array, for reference:

JSON metadata: {"id": "vlen-array", "dtype": <numpy-typestr>}

The <numpy-typestr> might be e.g. "<i4" to mean little endian int32.

num_items_in_chunk as uint32le (this is redundant since the chunk shape is already known from the .zarray metadata)

For i ranging from 0 up to num_items_in_chunk-1:

inner_size[i], as uint32le
normal binary encoding of inner_size[i] elements of inner array i, same as zarr normally uses. E.g. if the inner dtype specified in the codec json config is "<i4", then each element is encoded as little endian int32.

Thus the total encoded size is 4 + num_items_in_chunk * 4 + total_number_of_inner_elements * size_of_inner_element

JamiePringle commented 1 year ago

@DennisHeimbigner apologies; yes, if netcdf is compiled correctly, it can handle blosc; For the rest of you, most common sources of netcdf-c, i.e. Conda, macports, were not configured correctly to enable netcdf-c to read blosc, but it was able to be enabled with the appropriate installation. See the following discussion: https://github.com/Unidata/netcdf-c/issues/2484 . The issues presented in that thread are relevant to the discussion above. I think @WardF is working on this with the distributions.

@d-v-b ; indeed, I will try to use gzip next for my non-ragged arrays and see if stock netCDF as provided by the usual providers (Conda/macPorts/Ubuntu) can handle it. That is on my list for next week -- I was starting to do it before I discovered that even for uncompressed data, ragged arrays in Zarr could not be read by netCDF.

Ultimately, though, it would be best if a researcher who produced Zarr with the defaults discussed in the tutorial, or with the convienience functions like save() and load() with their defaults, produced data that could be read by netCDF.

I will try zip soon.

again, thank you all, Jamie

Jamie

meggart commented 1 year ago

Since Zarr.jl was mentioned here, the current status is that we currently support Blosc and Zlib as compressors and at least I have not stumbled upon a dataset in the wild where this was not sufficient, but of course we could extend this as needed. Regarding Filters (in v2 terms): we have implemented vlen-arrays, and there is a dangling PR for vlen-utf8, which I didn't manage to wrap up so far I agree that the situation regarding the vlen arrays was not ideal because of a missing spec, I basically did a mix of translating python (actually cython) code and trial-and-error inspecting bytes in an example file I created with zarr-python. However, in general this was not too complicated and one can implement both encoding and decoding together in less than 30 LOC. https://github.com/JuliaIO/Zarr.jl/blob/a57ccb2c6bdd25f6dc413d6bed8bd71c5ed36e68/src/Filters.jl#L28-L52

So I do agree that documenting a spec for vlen arrays would have helped, but agree we should not make this a v2 discussion but rather move on and make it a proper spec extension for v3. Feel free to have a look at the Julia code linked above, it reads a bit easier (at least for me) than the python implementation.

zarr-developers / zarr-specs

Vlen in specification #160