[v3] Refactoring our serialization handling

zarr-developers / zarr-python

An implementation of chunked, compressed, N-dimensional arrays for Python.

https://zarr.readthedocs.io

MIT License

1.54k stars 286 forks source link

[v3] Refactoring our serialization handling #2144

Open TomAugspurger opened 3 months ago

TomAugspurger commented 3 months ago

Zarr version

Numcodecs version

Python Version

Operating System

Installation

Description

While working on consolidated metadata, I bumped into some awkwardness of our serialization logic. It ended up being fine, but I wonder if we could do a bit better by implementing some dedicated logic to serialization and deserialization of the Zarr metadata objects.

For now, I'll assume that we aren't interested in using a 3rd-party library like msgspec, pydantic, or cattrs for this. My current implementation is at https://github.com/TomAugspurger/zarr-python/blob/feature/serde/src/zarr/_serialization.py and it's pretty complicated, but I'll be able to clean some things up (both in that file, and in the various to_dict / from_dict methods on our objects currently). I personally have a pretty high bar for libraries taking on dependencies, so I think this is worth it, but others may disagree.

I wrote up https://tomaugspurger.net/posts/serializing-dataclasses/ with a little introduction based on what I've learned so far. The short version is

Use a similar system for serialization (json.dumps(obj, default=custom_converter_function)), but centralize all the serialization logic in one spot
Use type hints (at runtime) to know what object to deserialize into. This gets complicated, but not too bad.

I'm not sure yet, but this might need to have some effect on the __init__ / __post_init__ of our dataclasses. In general, I'd like to move some of the parse_<thing> that are found in a lot of our dataclasses to the boundary of the program, and have the init of the dataclasses just be the default generated by @dataclass. This will mean duplicating some logic on the external facing methods like zarr.open, zarr.create, Group.create, but IMO that's worth it.

I still have some work to do (parsing nested objects like lists and tuples properly), but wanted to share https://tomaugspurger.net/posts/serializing-dataclasses/ sooner rather than later.

Steps to reproduce

n/a

Additional output

No response

d-v-b commented 3 months ago

I am very interested in improving our serialization logic, so thanks for working on this.

I bumped into some awkwardness of our serialization logic.

Could you explain exactly whats painful about serialization today? I have my own list of complaints, but I would be interested in hearing yours.

For now, I'll assume that we aren't interested in using a 3rd-party library like msgspec, pydantic, or cattrs for this.

That's a good summary of how we have approached this so far. The libraries you list all support a lot of use cases, but for zarr-python we merely need to round-trip a finite set of JSON objects into very simple python classes. That being said, my views on this have evolved a bit and personally I am more open to using a dependency here if it was lightweight and had a big positive effect on the codebase. "lightweight" rules out pydantic, but msgspec or cattrs could be viable. And if we do keep our home-grown solution, I think we should have an asymptotic goal for our type validator routines and metadata models to be usable for modelling zarr metadata in pydantic, cattrs, etc.

In general, I'd like to move some of the parse_ that are found in a lot of our dataclasses to the boundary of the program, and have the init of the dataclasses just be the default generated by @dataclass. This will mean duplicating some logic on the external facing methods like zarr.open, zarr.create, Group.create, but IMO that's worth it.

I would be curious to see how this looks. Personally I like having validation routines for class X in X__init__ because it ensures that you can't accidentally create invalid instances of X. We should ensure that validation occurs at the boundary of the Zarr metadata model, but in my mind this could be many different places in a program.

sneakers-the-rat commented 2 months ago

specifically: https://github.com/zarr-developers/zarr-python/issues/2081#issuecomment-2287358746

it would be great if this could get consolidated so it could also be hooked into by external objects.

so imo these two approaches:

Use a similar system for serialization (json.dumps(obj, default=custom_converter_function)), but centralize all the serialization logic in one spot Use type hints (at runtime) to know what object to deserialize into. This gets complicated, but not too bad.

would be simplified by having a __zarr_serialize__(self): hook s.t. the object had its own serialization method.

So where the single default function approach requires all the logic to be centralized, having the serialization hook would instead effectively work like:

try:
    return json.dump(obj)
except TypeError as e:
    if hasattr(obj, '__zarr_serialize__`):
        return json.dump(obj.__zarr_serialize__())
    else:
        raise e

I see the @encode_value.register decorator doing something like that, but being defined in place like that makes it a bit more challenging to define for types not anticipated by base implementation.

Then if you also added annotations to the serialization method like

return {
    module: obj.__module__,
    type: obj.__class__.__name__,
    value: obj.__zarr_serialize__()
}

(except less janky) you wouldn't need to do a big if/else switch statement on types in deserialization

d-v-b commented 2 months ago

I see the @encode_value.register decorator doing something like that, but being defined in place like that makes it a bit more challenging to define for types not anticipated by base implementation.

At least for now, the ser/deserialization logic in question is explicitly scoped to the small, finite set of JSON types enumerated by the zarr specs, stuff like the contents of zarr.json or .zarray, and if I had to guess, we will always aim to convert those types to the most boring python classes possible, e.g. frozen dataclasses. But this might be a discussion point :)

sneakers-the-rat commented 2 months ago

i dont' mean to derail the conversation, so maybe we can limit this discussion to #2081 , but the point i was trying to make is that you can can get both by using serialization hooks :) - the boring types get a consistent serialization interface, but then if someone wanted to extend it (downstream, not in base zarr), they would have a clear means of doing so. multiple birds with one stone: clean up serialization/deserialization for zarr internals and support generic object serialization (which is i think being deprecated in v3 right?) using a system that can be reused across zarr rather than having separate serdes implementations for different parts of the library (if i'm understanding correctly, this is just for metadata objects? not sure if different than rest of serialization)

but anyway just chiming in because it was related to what we were talking about elsewhere, don't want to muddy the waters

d-v-b commented 2 months ago

(if i'm understanding correctly, this is just for metadata objects? not sure if different than rest of serialization)

Yeah, the current model is that the zarr v3 and v2 specs define a small handful of JSON objects that schematize groups or arrays. Besides the user-defined attributes (which doesn't get inspected at all, save for checking that it's JSON serializable), this set of JSON objects is closed, and there is no general anticipation that users would extend this set (extending this set of objects would amount to extending the zarr spec itself). With that constraint, the appeal of generic object serialization is pretty limited. The design that tom has problems with does use a serialization hook: it's called to_dict() and it has some problems that we need to fix :)

At the same time, there's a completely separate serialization path for the contents of arrays (chunks). This path is defined by the particular configuration of compressors / codecs specified in the metadata for the array, and chunk serialization generally produces an opaque bytestream at the end (e.g., because chunks get compressed).

TomAugspurger commented 2 months ago

Could you explain exactly whats painful about serialization today?

Just going off memory, I think the first thing I ran into was failing to implement or call to_dict() and / or from_dict. So on the serialization side I probably ended up with values that weren't JSON serializable, and on the deserialization side I ended up with plain dicts instead of dataclass instances. This was in the context of consolidated_metadata, so the new thing to (de)serialize was a dict[str, Metadata]. Because we know that Metadata instances should already be serializable, It'd be kind of nice to just not have to think about that.

Personally I like having validation routines for class X in Xinit because it ensures that you can't accidentally create invalid instances of X

Does using a static type checker like mypy solve your concerns, when instances are being created directly? This doesn't really help if the program creating an ArrayMetadata isn't type checked, but at least internally to zarr-python I think the static type checking will catch a large class of errors.

And we do have the option of keeping a __post_init__ for validating values, like parse_typesize validating that the value is positive.

Thanks for the feedback @sneakers-the-rat. For this issue, I'm mostly concerned with round-tripping things that are in zarr-python, not necessarily a system that would be exposed to users or developers building on zarr-python. That said, fill_value is in a tricky spot. https://zarr-specs.readthedocs.io/en/latest/v3/core/v3.0.html#fill-value mentions that

Extensions to the spec that define new data types must also define the JSON fill value representation.

complex maybe offers an example here. Complex numbers aren't natively JSON serializable, so the spec says to serialize them as a [float, float]. In zarr-python land, we want that to be deserialized to complex, so we do need a spot to declare that.

If I had to guess, at some point zarr-python will need to grow some kind of "registration" system for (de)serializing essentially arbitrary objects. But that's a bit murky to me.

d-v-b commented 2 months ago

Does using a static type checker like mypy solve your concerns, when instances are being created directly? This doesn't really help if the program creating an ArrayMetadata isn't type checked, but at least internally to zarr-python I think the static type checking will catch a large class of errors.

And we do have the option of keeping a __post_init__ for validating values, like parse_typesize validating that the value is positive.

Maybe we don't even disagree :) I didn't intend to fixate on __init__ in particular. I don't really care where the "validate on construction" logic lives -- __init__ and __post_init__ are both fine with me, provided the basic metadata classes cannot be directly constructed in an invalid state. If redundant validation is a performance concern then we could also consider backdoors to create objects with pre-validated data, but i'm not sure how to do that.

And the type checker is nice but i'm specifically interested in runtime: we should ensure that people can import zarr metadata classes, create a schema for a zarr array, and save it to storage, without worrying that their metadata was invalid. If we have to have object validation somewhere, then validation on construction is, I think, the only way for this to be a smooth user experience.

Also, not all the zarr metadata invariants can be caught by the type system, e.g. positive integers. So we need runtime validation, the question is just where to put it I guess.

d-v-b commented 2 months ago

to your points about the flaws in the current serialization:

Just going off memory, I think the first thing I ran into was failing to implement or call to_dict() and / or from_dict. So on the serialization side I probably ended up with values that weren't JSON serializable, and on the deserialization side I ended up with plain dicts instead of dataclass instances. This was in the context of consolidated_metadata, so the new thing to (de)serialize was a dict[str, Metadata]. Because we know that Metadata instances should already be serializable, It'd be kind of nice to just not have to think about that.

If I recall correctly, the motivation for keeping python types around in the output of to_dict was that to_dict is returning a python dict, not raw JSON, and so for anyone who might work with that dict in python, a value like a numpy dtype is a bit more useful and intentional than a string serialization of a dtype (which looks just like any other string). I do like the clear boundary between stored JSON types and in-memory python types, because it means that all the type conversions happen in one place, at ser / deserialization time, and the JSON serialization logic only needs to know about simple python types and is thus reusable. But we are definitely in a position to re-think this decision, if it's causing problems. The pydantic solution is to define two different modes for model serialization, one that preserves python types and one that does not. I don't have any problem with this.

TomAugspurger commented 2 months ago

The "validation on construction" question is interesting. There are (I think) two components:

Type checking (e.g. ensuring that ArrayMetadata.shape is a tuple (or perhaps even a tuple of integers))
Value checking (e.g. ensuring that all the values in ArrayMetadata.shape are non-negative integers)

And there are roughly two places where we could ensure / validate that:

In the constructor (init / post_init)
In the deserialization logic

@dataclasses.dataclass
class Metadata:
    shape: tuple[int]

    def __post_init__(self):
        if not isinstance(self.shape, tuple): raise TypeError  # or we could cast
        if not all(x > 0 for x in self.shape): raise TypeError(...)

Whether or not to validate that is

FWIW, my preference is to probably to

Simplify the __init__ / __post_init__ classes as much as possible by
- removing all parsing / coercion like parse_shapelike, parse_dtype, etc. and require / assume that users are providing the right types (internally, we'd use mypy to verify that)
- keeping the value validation logic (ensuring that all the elements in the shape tuple are positive, for example)
Rely on typed deserialization to ensure that we get the right types when building one of our dataclasses from data at runtime.

So I think that my proposal would allow for something like metadata = ArrayV3Metadata(shape=[1, 2, 3], ...) which (incorrectly, but maybe harmlessly?) passes a list for shape instead of a tuple. But it would error at runtime if you did ArrayV3Metadata(shape=[-1, 2, 3]), and would happen to error if you did ArrayV3Metadata(shape=1) because 1 isn't iterable, which we rely on for the value check.

d-v-b commented 2 months ago

So I think that my proposal would allow for something like metadata = ArrayV3Metadata(shape=[1, 2, 3], ...) which (incorrectly, but maybe harmlessly?) passes a list for shape instead of a tuple.

This is allowed today. ArrayV3Metadata.__init__ takes shape annotated as Iterable[int] (and we could even consider widening this type to to Iterable[int] | int). __init__ invokes parse_shapelike, which returns tuple[int, ...], with the additional guarantee that the values are non-negative.

I definitely feel like ArrayV3Metadata.__init__ is pretty ugly, and would benefit from the introduction of some abstractions to simplify things.

removing all parsing / coercion like parse_shapelike, parse_dtype, etc. and require / assume that users are providing the right types (internally, we'd use mypy to verify that)

This sounds like a bad user experience? Why should we allow the construction of invalid objects? I could see an argument for using something like a typed dict to model the serialized form of ArrayV3Metadata, and this object would be totally (and obviously) free of runtime guarantees, but to me the entire point of modeling structured data with a class is that we can use methods to catch errors at class instantation time. Maybe some code or a PR will make things more clear to me.