zarr-developers / zarr-specs

Zarr core protocol for storage and retrieval of N-dimensional typed arrays
https://zarr-specs.readthedocs.io/
Creative Commons Attribution 4.0 International
87 stars 28 forks source link

V3: node_type considered harmful #269

Open DennisHeimbigner opened 11 months ago

DennisHeimbigner commented 11 months ago

I believe that putting the node type (group or array) into the zarr.info object is not a good idea. In Zarr v2, the node type was encoded in the object name: .zgroup or .zarray. Hence it is possible to identify all the groups and all the arrays without having to read any object. With Zarr V3, you can only identify the groups and arrays by reading each occurrence of the zarr.info object. This is potentially a significant performance hit. I would propose a change to something like "zgroup.json" and "zarray.json".

rabernat commented 11 months ago

Dennis, it's great to finally have your feedback. This issue was discussed extensively nearly a year ago, see https://github.com/zarr-developers/zarr-specs/issues/177 There was a strong rationale for doing it this way (avoiding a race condition with two concurrent writers trying to create a group and array of the same name). But there were also many (including myself), who favored your position.

You're very active today with many great suggestions. Unfortunately, the window for major changes to the V3 spec has passed. On May 8, @WardF voted yes on ZEP-1 on behalf of Unidata: https://github.com/zarr-developers/zarr-specs/issues/227#issuecomment-1536541942.

Now that you are actually trying to implement Zarr V3, you are finding many problems with the spec. I think this is very natural and to be expected. I feel we got the sequencing wrong by first writing this massive spec without really trying to implement it. But if we stick to the process we have defined for ourselves, the comment period is over.

I'd welcome thoughts on how to move forward in a productive way.

jbms commented 11 months ago

A key point that came up in the discussions was that typically when listing the contents of a zarr hierarchy, you will most likely want more information than just whether it is a group or an array, such as the shapes and data types of any arrays, and would need to read the metadata anyway.

DennisHeimbigner commented 11 months ago

"When listing the contents of a zarr hierachy..." Agreed, But one use case we have is lazy loading of arrays. In that case, we need to locate where the arrays are located and then read the zarr.info only when the array is accessed.

DennisHeimbigner commented 11 months ago

I am in the process of implementing V3, and unfortunately, the only way I have of really understanding the spec and its consequences is through an implementation. Just the way I work. As for being too late, surely there is room for changes. I see a number of PRs (even referenced in the V3 spec) proposing significant changes: e.g. moving attributes to a separate object.

jbms commented 11 months ago

"When listing the contents of a zarr hierachy..." Agreed, But one use case we have is lazy loading of arrays. In that case, we need to locate where the arrays are located and then read the zarr.info only when the array is accessed.

Can you say a bit more what "lazy loading" means? Is the issue that your API provides a list of all of the arrays within a group (but no other information)?

DennisHeimbigner commented 11 months ago

Yes. Also with groups. We also do lazy attribute loading which is why I support storing attributes as a separate object.

jbms commented 11 months ago

Could you add a new API function that returns the list of all members, sub-groups and arrays, and recommend that users call that instead for greater efficiency when using zarr v3?

DennisHeimbigner commented 11 months ago

Not sure I follow. There seems -- to the client -- not a lot of difference in calling two functions, one for groups, one for variables versus calling a single function. Where does your proposed efficiency come from with respect to V3?

jbms commented 11 months ago

If there is one API function that lists both arrays and groups, and just returns a flat list of names without indicating which are arrays and which are groups, then you don't need to read the zarr.json files to handle calls to that API function.

DennisHeimbigner commented 11 months ago

I see. Our API does differentiate. And this is important because it is possible to walk the group tree without having to read any objects (at least in NCZarr).

rabernat commented 11 months ago

As for being too late, surely there is room for changes.

Yes, I am very optimistic that we can incorporate your suggestions and input. FWIW, I think it would be better to evolve the spec incrementally towards a "final" version as we work on implementing it, rather than deciding on the full spec up front and approving it through a big vote. The way STAC has developed their spec seems to be more effective.

And this is important because it is possible to walk the group tree without having to read any objects

I see this as a very important goal. The algorithm for walking a hierarchy is considerably more complex if you can't identify whether a node is a group or array based on the node name. As some point, the decision was made that the cost of reading the metadata files was not significant. However, I'd like to see some quantitative analysis to confirm this for realistic datasets.