v3 implementation experiment (zarrita)

alimanfoo commented 4 years ago

Hi all, using this issue to capture some notes from experience of a very quick-and-dirty implementation of the v3 core protocol spec, zarrita.

I thought I would try a fresh Python implementation to test some the ideas in the v3 protocol spec. I did this from a blank slate to give myself freedom to change the API and internal architecture if it made implementation easier, i.e., so I didn't have to worry about compatibility with the current Zarr Python implementation. Also it made it easier to focus just on the feature set in the v3 spec, which is smaller than the feature set in the v2 spec.

Please note that this is just a technical spike to test feasibility of spec implementation. I am not proposing to further develop or maintain zarrita as a separate package, and I am not proposing any API changes for the main Python package. Also this does not diminish in any way the value of the Python implementation work in https://github.com/zarr-developers/zarr-python/pull/568, that PR and the main Zarr package are where the development efforts should be focused. This was merely an experiment and lightweight vehicle for testing ideas.

Zarrita is a single Python module with just over 1000 lines of code. There are a collection of doctests in the repo README which I've used to verify the main functionalities are working. I have put no effort into documentation or error handling, and there will be lots of edge cases that aren't handled properly.

It implements the full feature set in the v3 spec (I think). An exception is that only C contiguous chunk memory layout is currently supported, although F contiguous should be simple to add.

In terms of compressors, only gzip or no compression is supported. Support for any numcodecs compressor could be added easily, but we only have draft codec specs for gzip at the moment, so I limited to what we have specs for.

Storage can be local file system or any remote file system supported via fsspec. E.g., S3, GCS, ABS should all work if the relevant packages are installed, although I only tested with GCS only so far.

Most of the new implementation work was around managing metadata, which is the main difference between v2 and v3 protocols. I needed new functions for creating hierarchies, groups and arrays, and for accessing and browsing nodes in the hierarchy. However, all these were straightforward to implement, and are efficient in terms of the number of underlying storage calls required. I.e., it's quite responsive to browse the hierarchy via cloud storage, which was a pain point with v2.

Because v3 has the notion of hierarchy as an entity, in addition to the root node (which can be either a group or array), it was most natural to create a Hierarchy class, in addition to Group and Array classes. This is the main place where the API and architecture diverges from the current Python Zarr package. Also, the concept if implicit groups is new, so I had to think a little how to handle those.

Apart from that, all of the logic for reading and writing data in arrays could be lifted over from the current Python Zarr package, with just some edits to cut out the extra features that are not needed for a vanilla v3 implementation.

There are several advanced features that are supported in the main Zarr Python package that I haven't tried to implement. E.g., I haven't implemented resizing arrays, or any kind of support for locking, or any of the advanced indexing methods.

Hopefully this is a useful focal point for some discussion. Comments welcome.

jakirkham commented 4 years ago

cc @Carreau @joshmoore @ryan-williams

alimanfoo commented 4 years ago

Btw @martindurant I used fsspec here but I didn't use the FSMap class but rolled my own mapper class for now just because to get this working I needed a method on the store class to do a directory-style listing of keys, which I implemented here. Let me know if you think sensible to raise an issue over the fsspec repo to discuss further.

martindurant commented 4 years ago

Note that this is also the case in my stalled PR ( https://github.com/zarr-developers/zarr-python/pull/546/files?file-filters%5B%5D=.py#diff-31d15042dbeedbf2942ace2ad4b9b2e2R1046 ). That implementation perhaps tried to do too much by combining n5-nested keys and consolidated metadata into one implementation, but I didn't think it was that complicated. I don't know how different the needs of v3 are (haven't followed your link yet, as I'm supposed to be on time off), but the version in the PR might be a good template.

alimanfoo commented 4 years ago

Thanks @martindurant. Just to note that for v3 I need to be able to distinguish between "files" and "directories" in the result. (I.e., I need to know which results are actual keys, versus which results are key sub-prefixes.)

P.S., don't read this message, you're supposed to be on time off :-)

martindurant commented 4 years ago

Your implementation still looks rather similar to mine, of course :) Was it not easier to subclass FSMap directly?

I'm not sure that adding methods to FSMap is reasonable, because it gets away from the "mappyness", while a simple wrapper class like this one is easy to make. On the other hand, zarr must be the most frequent user of FSMap.

Carreau commented 4 years ago

Thanks, I think it would still be great to push a bit zarrita and see if we can test it against the main zarr implementation to to see if we get differences, it may help to spot bugs.

As I do in my PR you have the same check everywhere that when path === '/' then there is s different logic to get the key, as there is no slash between root and .whatever I'm really wondering if this is meaning we should avoid this special case in the spec, either saying that the root metadata is at root/.whatever

We talked about this before here, and doing so seem to complicate listing dir as you pointed out. SO i"m not sure.

joshmoore commented 4 years ago

In generally, :100: for the quick-and-dirty implementation for initial evaluation. It makes the "skeleton" of the new spec very apparent.

Note that this is also the case in my stalled PR

It's not stalled due to me is it?! I've been using it locally. :+1:

martindurant commented 4 years ago

It's not stalled due to me is it

mostly I simply haven't had the chance to fill it out; but all help appreciated, since I still don't know if/how to merge the nested and consolidated features.

alimanfoo commented 4 years ago

As I do in my PR you have the same check everywhere that when path === '/' then there is s different logic to get the key, as there is no slash between root and .whatever I'm really wondering if this is meaning we should avoid this special case in the spec, either saying that the root metadata is at root/.whatever

We talked about this before here, and doing so seem to complicate listing dir as you pointed out. SO i"m not sure.

Yeah, I think you end up having to do some special casing either way.

Only way I can think of to avoid special casing would be to say that the root path is the empty string ('') rather than '/'. Then you could always construct the metadata key by doing meta/root{path}.group.

But the notion of using '/' as the root path is pretty ingrained in APIs, so was reluctant to drop that.

alimanfoo commented 4 years ago

Your implementation still looks rather similar to mine, of course :) Was it not easier to subclass FSMap directly?

Probably, was just hacking it together and did the first thing that came to me.

I'm not sure that adding methods to FSMap is reasonable, because it gets away from the "mappyness", while a simple wrapper class like this one is easy to make. On the other hand, zarr must be the most frequent user of FSMap.

FWIW I think it would be reasonable for FSMap (or any zarr store class) to offer methods for querying keys hierarchically based on use of "/" as the path delimiter. I don't think it takes away from the mappyness, there is no obligation to implement any kind of hierarchical storage internally, the data can still be a flat mapping. It's just about querying keys, analogous to the way that cloud object stores allow listing objects providing a prefix and a delimiter.

martindurant commented 4 years ago

I think it would be reasonable for FSMap (or any zarr store class)

I was talking about the implementation that actually lives in fsspec - which pretends to be a MutibleMapping and nothing more. As a zarr storage backend, it totally makes sense. Perhaps it makes sense anyway.

zarr-developers / zarr-specs

v3 implementation experiment (zarrita) #84