zarr-developers / zarr-python

An implementation of chunked, compressed, N-dimensional arrays for Python.
https://zarr.readthedocs.io
MIT License
1.46k stars 274 forks source link

[v3] Refactor metadata classes #1623

Open normanrz opened 8 months ago

normanrz commented 8 months ago

In the current V3 draft version, all metadata classes are closely modeled after the json representation. For example, the regular chunk grid is represented as this json object:

{
  "name": "regular",
  "configuration": { "chunk_shape": [2, 5] }
}

In Python, these objects are implemented as 2 classes

@frozen
class RegularChunkGridConfigurationMetadata:
    chunk_shape: ChunkCoords

@frozen
class RegularChunkGridMetadata:
    configuration: RegularChunkGridConfigurationMetadata
    name: Literal["regular"] = "regular"

ChunkGridMetadata = Union[RegularChunkGridMetadata]

This is an artifact of using cattrs for (de)serialization. However, we were thinking of dropping the attrs dependency. That would open the way for a more ergonomic class design that serializes to the same json structure.

class ChunkGrid(ABC):
    pass

class RegularChunkGrid(ChunkGrid):
    chunk_shape: ChunkCoords
jhamman commented 8 months ago

@normanrz - I like this suggestion. Perhaps you could expand on what the serialization/deserialization flow might look like. In the proposal to basically unpack the JSON object by key in a sequential pattern like this:

zarr_json = json.loads(...)

if zarr_json["node_type"] == "array":
    ...

    if zarr_json["chunk_grid"]["name"] == "regular":
        chunk_grid = RegularChunkGrid(chunk_shape=zarr_json["chunk_grid"]["configuration"]["chunk_shape"])

array_meta = ArrayMetadata(..., chunk_grid=chunk_grid, ...)
normanrz commented 8 months ago

I think it could work via delegation from the parent classes:

class ChunkGrid:
    @classmethod
    def from_dict(cls, val: Dict[str, Any]) -> ChunkGrid:
        assert isinstance(val, dict)
        name = val["name"]
        if name == "regular":
            return RegularChunkGrid.from_dict(val)
        else:
            raise NotImplementedError

class RegularChunkGrid(ChunkGrid):
    @classmethod
    def from_dict(cls, val: Dict[str, Any]) -> RegularChunkGrid:
        assert isinstance(val, dict)
        assert val["name"] == "regular"
        assert _validate_int_list(val["configuration"]["chunk_shape"])
        return cls(chunk_shape=tuple(val["configuration"]["chunk_shape"]))

...

zarr_json = json.loads(...)

if zarr_json["node_type"] == "array":
    ...

    chunk_grid = ChunkGrid.from_dict(zarr_json["chunk_grid"])

array_meta = ArrayMetadata(..., chunk_grid=chunk_grid, ...)

With cattrs we had nice input validation including reasonably understandable error messages. We need to decide if we want to add that now or later.

In the future this could be extended so that ChunkGrid.from_dict uses an extensible registry for different chunk grid variants.

normanrz commented 5 months ago

@d-v-b you refactored this quite a bit. Would you agree to consider this done?

d-v-b commented 4 months ago

I think we still have work to do on this, but we can work on it after the alpha release