zarr-developers / zarr-python

An implementation of chunked, compressed, N-dimensional arrays for Python.
https://zarr.readthedocs.io
MIT License
1.45k stars 273 forks source link

Failure to encode `object` types when used with `zarr.full` #2081

Open sneakers-the-rat opened 1 month ago

sneakers-the-rat commented 1 month ago

Zarr version

v2.18.2

Numcodecs version

v0.13.0

Python Version

3.11

Operating System

Mac

Installation

pip :)

Description

When using an object (specifically a pydantic model) as the fill_value in full, the metadata encoding step fails to encode (pickle) the model. It is instead passed unencoded to the JSON codec which chokes.

Steps to reproduce

from pydantic import BaseModel
import zarr
from numcodecs import Pickle

class MyModel(BaseModel):
    x: int

array = zarr.full(
    shape=(1,2,3),
    fill_value=MyModel(x=1),
    dtype=object,
    object_codec=Pickle()
)

Additional output

Failure happens in encode_array_metadata where it tries to call json_dumps on

{
  'zarr_format': 2, 
  'shape': (10, 10, 10), 
  'chunks': (10, 10, 10), 
  'dtype': '|O', 
  'compressor': {'id': 'blosc', 'cname': 'lz4', 'clevel': 5, 'shuffle': 1, 'blocksize': 0}, 
  'fill_value': MyModel(x=1), 
  'order': 'C', 
  'filters': [{'id': 'pickle', 'protocol': 5}]
}

which, of course, fails :(

(sorry some of my values are different in the meta dict and the example, running this from my tests atm but can reproduce just by running the example)

d-v-b commented 1 month ago

Thanks for raising this issue. I think this reveals some fundamental problems with the "object" dtype. Basically, because .zarray is JSON, the fill_value attribute must also be JSON. For numeric types this is fine (it's not perfect -- JSON numbers can't represent some types of NaN), but for other types, we need to define a JSON encoding for the fill_value, and this requires spec changes -- see this section of the JSON encoding for fixed-length bytes strings and structured dtypes.

In principle we could alter the zarr v2 spec to include language describing a JSON encoding for fill_value in "object" dtype arrays, e.g. base64-encoded output of pickle.dumps(), but this is a very python-specific change for a file format with implementations in multiple languages. Also, we are mostly working with the zarr v3 spec these days which does not have support for "object" dtypes, because they are so problematic from a multi-language storage perspective.

What's the goal of setting the fill value to be a pydantic model? Maybe there's another way to achieve what you want.

sneakers-the-rat commented 1 month ago

good to know!

actually i would rather not dump the pickled object, but would much rather be able to provide a hook to serialize it to JSON!

context: i'm translating neurodata without borders to linkml, and the models use another tool I wrote, numpydantic, to be able to do shape and dtype specifications with arbitrary array backends. the case here is because NWB has a lot of inter-object references as arrays, so for example with this VectorData class, the array will often be an array of other model objects to index, sort of like this:

from numpydantic import NDArray, Shape
from pydantic import BaseModel
from nwb_linkml.models import Unit

class UnitTable(BaseModel):
    units: NDArray[Shape["* n_units"], Unit]

and so that NDArray class allows 1-D arrays of Unit passed as numpy arrays, dask arrays, zarr arrays, and so on. here's the zarr interface, super simple

The interface system, as well as my code generator give me pretty good control over the objects and models that are created, and what would be ideal for me is to have some kind of hook that i can add to my models for serialization/deserialization. The NDArray interface can then inject that hook method into models that are passed through it.

So where pydantic has __get_pydantic_core_schema__ that allows one to customize the validation and serialization process, if i also had a __zarr_serialization__ method that allows me to control how the object gets serialized/deserialized that would be amazing. Something like, for the sake of illustration...

import zarr
from dataclasses import dataclass
from pydantic import BaseModel
from numcodecs import Codec

@dataclasses
class ZarrSerialization:
    data: dict[str, str | float | int]
    """whatever representation of the object is JSON-able"""
    array: Optional[zarr.Array]
    """if this object can be directly converted into a zarr array..."""
    source_object: str
    """module.object_name"""
    metadata: dict[str, str | float | int]
    """Any other json-able stuff"""

class MyClass(BaseModel):

    def __zarr_serialization__(self, codec: Codec, ctx: Optional[zarr.SerializationContext] = None) -> ZarrSerialization:
        # return something zarr knows how to make
        return ZarrSerialization(
            data: self.model_dump(),
            source_object: ".".join([self.__module__, self.__name__]),
            metadata: {'whatever': 'else'}
        )

    @classmethod
    def _from_zarr(cls, serialization: ZarrSerialization) -> 'MyClass': ...
        # rehydrate the model from serialization

just as a super rough example. So maybe I take the codec that is requested during serialization, i give enough information as would be needed to re-create the object (or, from a multi-language perspective, I could also be able to specify that this was from a Python object, so other languages would know they werent' supposed to try and handle it, you know what's needed there better than me). and any other information that would be useful. and then I take that object back when loading the array (either from another fixed-name method, or i can give that during the serialization). Many of these objects have arrays nested within them, so if i could hook into the zarr serialization process generally I could return the model fields that are arrays as arrays, and then store the object metadata around that - i think the .zarr json could support that!

so then like yaml there is a 'safe load' that just returns the JSON object, and an 'unsafe load' which tries to rehydrate/cast objects. that may cut down on the complexity of supporting arbitrary objects - "we only support objects that have specifically implemented our serialization protocol"

the reason why it would be good to have arbitrary control over what gets serialized (rather than always just a pure dict of the object contents) is that in eg. NWB i'm sharing object references in my instantiated models to imitate the HDF5 object references, and so when serializing i would want to only save the instantiated model in one place and in other places save a reference to it.

Y'all know more about what's good for the format than I do obviously, and understood that arrays of objects are intrinsically awkward, but what i imagine happening with dropping support for objects (the numcodecs system is nice!) is that people will just store things as long opaque strings which is also not great for cross platform use. I would be more than happy to implement this if you're interested, because then zarr becomes sort of like a magic format to me and i can just transparently use it as a backing store for this and other data, and numpydantic can sort of behave as an "ORM-like" interface to zarr stores.

lmk!

d-v-b commented 4 weeks ago

actually i would rather not dump the pickled object, but would much rather be able to provide a hook to serialize it to JSON!

If the fill value is JSON, then maybe it's simpler to think of the zarr array having a JSON dtype? I don't think this is very ergonomic in zarr today, because zarr is designed more for numeric types. But at least using JSON gets you around overfitting to python data structures.

That being said, I'm not sure I fully understand the plan to serialize an array model inside an array (correct me if this is not an accurate characterization).

sneakers-the-rat commented 1 week ago

sorry, issue fell out of my notifs-

tl;dr: it would be nice to have a hook into zarr's serialization methods to implement the extension points in zarr 3 for storing objects rather than storing objects as serialized blobs, which was the initial idea.

So my overall goal is to be able to patch into zarr as a backend for data models that include arrays, and sometimes those arrays include things like references to other arrays (or more generically, objects that require custom serialization). This is for neurodata without borders, if it helps with context at all, since i think y'all have overlap with that dev team.

The existing behavior of being able to provide your own serialization codec works, but it's a little awkward, where i need to implement the serialization behavior in the thing that contains the special type, rather than having a hook to allow that special type provide its own serialization. That's one option, imperfect, because it's basically ignorant of the zarr storage model and would basically be stored as a variable length string. That's the OP of the issue.

What i would really like to be able to do is to directly patch into the zarr serialization format itself and serialize the object as zarr rather than serialize the object in zarr - especially if y'all are dropping support for objects. that's the later comment.

So take for example this relatively simple data model, where I have some model Trajectory to store the locations of an object through time, and Flock that stores a simultaneously measured collection of those objects. (not a very good example, but the same idea holds with n-dimensional data, or if eg. one had measured some value like environmental temperature at each position x,y,z over time):

from pydantic import BaseModel
from numpydantic import NDArray, Shape

class Trajectory(BaseModel):
    some_field: str = "whatever"
    latitude: NDArray[Shape["*"], float]
    longitude: NDArray[Shape["*"], float]
    time: NDArray[Shape["*"], float]

class Flock(BaseModel):
    other_field: int = 5
    trajectories: NDArray[Any, Trajectory]

So Trajectory is easy to do with zarr, that might look something like...

import zarr
from zarr.store import LocalStore

t_store = LocalStore('trajectory_1.zarr', mode='w') 
trajectory = zarr.group(store=t_store,
               attributes={'some_field': 'whatever'})
latitude = trajectory.create_array('latitude',shape=(10,1), fill_value=0)
longitude = trajectory.create_array('longitude', shape=(10,1), fill_value=0)
time = trajectory.create_array('time', shape=(10,1), fill_value=0)

and that same thing would work with numpydantic which wraps zarr

trajectory = Trajectory(
    latitude = zarr.zeros(shape=(10,1))
    longitude = ("trajectory_1.zarr", "longitude"),
    time = np.arange(10)
)

But it would also be nice to be able to provide a serialization hook so that for these models I can tell zarr how they map onto zarr's group structure. So for Flock, ideally i don't want to do is pickle/jsonize each Trajectory, but keep it in zarr's format, so if there was some hook like __serialize_zarr__ to tell zarr how a given object should be represented in its core spec (similar to pydantic's schema system), I might be able to do something like this

flock = Flock(
    trajectories = [t1, t2, t3]
)
zarr.save('my_data.zarr', flock)

and have that come out as a group flock with subgroups 0, 1, 2, etc. that each contain the group that would have been created for trajectory above. In the case that trajectories is n-dimensional, then I could manage encoding those coords myself, or hook into the chunk_key_encoding extension point (more on that below).

This would be very useful for using zarr to model more complex data standards and formats that have things that zarr doesn't support like references, etc. - provide a serialization method that zarr understands so it becomes a transparent backend to the object which acts like an ORM model (ish).

so re:

That being said, I'm not sure I fully understand the plan to serialize an array model inside an array (correct me if this is not an accurate characterization).

i want to make a model s.t. something behaves like an array of models, but doesn't necessarily get stored as an array of serialized blobs.

This seems like it might fit in with zarr 3's extension points - eg. if there was a hook where I could specify that something should be stored with a custom data_type that is a reference to another group, or a custom chunk_key_encoding so that i can map array positions directly to keys in the store.

Sorry that this issue drifted focus, i can split off into a separate one if we want to keep this just as a bug report for the specific problem in the OP

d-v-b commented 1 week ago

I think I'm seeing two challenges in this issue (feel free to correct me if this summary is bad). The first is how to map pydantic models onto zarr hierachies, and the second is how to serialize references to zarr arrays / groups.

Regarding mapping pydantic models to zarr hierarchies, you say:

But it would also be nice to be able to provide a serialization hook so that for these models I can tell zarr how they map onto zarr's group structure.

My approach to this has been to explicitly model zarr's hierarchical group structure in pydantic, and then serialization from zarr-the-model to zarr-the-format is relatively simple. Modelling zarr hierarchies explicitly comes at a cost -- I can't serialize an arbitrary pydantic model to a zarr hierarchy, but that's a potentially unbounded problem.

Generally speaking, if you have some data structure X that isn't shaped like a hierarchy of nodes, where each node has an attributes field, and is either a container for uniquely named subnodes (like a zarr group) or an object with all the properties of a zarr array (like a zarr array), then serializing X to a zarr hierarchy will require making a lot of decisions, many of which might depend on the particular structure of X. I don't see how changes internal to zarr-python can support this kind of thing.

Regarding serializing references to arrays: Zarr has no formal support for this, so you would basically need to create your own serialization scheme that can map the references to the types that zarr does support: JSON and numerical values. It seems like the former is a bit easier than the latter. There's some prior art in how hdf5 models virtual datasets, and there might be users of Zarr today who make use of references to arrays and groups, but I don't have a lot of experience with this.

sneakers-the-rat commented 1 week ago

Modelling zarr hierarchies explicitly comes at a cost -- I can't serialize an arbitrary pydantic model to a zarr hierarchy, but that's a potentially unbounded problem. [...] then serializing X to a zarr hierarchy will require making a lot of decisions, many of which might depend on the particular structure of X. I don't see how changes internal to zarr-python can support this kind of thing.

Exactly - that's why you provide a __zarr_serialize__ hook to allow the object to declare how it is to be serialized and make those decisions, rather than trying to make a generic 'serialize all pydantic models' method in the library. Then the only thing that zarr-python needs to do is call that hook during serialization.

Ideally the method signature would look something like this:

def __zarr_serialize__(self, context) -> T:

where context contains the information about what group/array/etc. the object is being serialized in (if not the already-serialized json for the parent) so it can either modify it if needed (ie. if some metadata needs to be added to the parent) or just return itself as T (placeholder for json-able type, not sure if you already have a type for that).

Regarding serializing references to arrays: Zarr has no formal support for this, so you would basically need to create your own serialization scheme that can map the references to the types that zarr does support:

yes this would be one of the purposes of providing a serialization hook, to be able to serialize things that aren't currently supported like references in such a way that the downstream application can understand how to deserialize without needing to overcomplicate the base zarr library

d-v-b commented 1 week ago

What class(es) in zarr-python would implement __zarr_serialize__?

sneakers-the-rat commented 1 week ago

potentially none, if you didn't want to use it internally, it would be something called during the various methods like https://github.com/zarr-developers/zarr-python/blob/60b4f57943419d05d831de227ce58ea2fa1997d1/src/zarr/core/array.py#L501 where something can't be cast to an NDArray-like thing, it calls the __zarr_serialize__ method and it returns whatever is expected there. Otherwise it would be on the Array and Group classes, and it seems like the thing that would be returned is GroupMetadata or ArrayMetadata. So maybe another idea would be to have two separate methods like __zarr_array__ or __zarr_group__ for an object to declare whether it should be treated like an array or a group if those things are handled separately.

edit: or maybe another way would be to pass the store and metadata/kwargs into the serialization method and return an instantiated Array or Group if we wanted the hook to be totally opaque, then it just substitutes for _create* methods.

I haven't read the v3 spec or implementation yet, but if this was something y'all might be interested in i could do a more thorough proposal that includes potential implementations - at this point i'm just pitching an idea that amounts to "i would really like to be able to hook into the zarr serialization process so that I can encode models that contain arrays natively," but again would love to help implement it if there is interest