move towards grid objects

keewis commented 8 months ago

[x] towards #35, closes #34

Before I do too much work on this, here's a draft of what I would imagine this concept to look like.

The objects are tentatively named *Info, once we have a better name I'm open to changing to that. Note that I decided against a common base class, because the only shared methods are from_dict and to_dict. For typing (once we really have that), I imagine we'd use a protocol instead. However, if we find good reason to go the other way I'd be open for that as well.

benbovy commented 7 months ago

Nice! (Sorry for the wait I missed this PR)

*Info sounds good to me.

Do we want to (eventually) publicly expose those "*Info" classes or use it only for internal purpose? I'm leaning towards the former for better documentation of the specification of each grid system and for better / safer handling of this metadata by users (i.e., allow passing those objects directly via public API). If we expose them we might need to move the conversion / validation logic from .from_dict into .__post_init__ (although the workaround for setting attributes of frozen dataclasses in post_init is rather ugly).

Regarding base class vs. protocol, I find the latter a bit overkill here (we don't expect dealing with heterogeneous "*Info" objects implemented in loosely coupled libraries). Using a base class seems cleaner to me and allows defining attributes that should be common to all DGGS specifications (e.g., "resolution").

I think a good reference is pyproj.crs.CRS.

keewis commented 7 months ago

Base class instead of protocol is fine with me, of course. I'm also leaning towards exposing the grid objects, as that would be the natural place to put grid generation methods (see #35).

However, I don't think we should move the conversion and validation logic to the constructor: DGGSInfo.from_dict({...}) is not that much more to write (use dict(level=..., ...) if keyword arguments are important), and it pushes the user towards the "standard" terms. This would also avoid the use of object.__setattr__ in __post_init__, which as you say is quite ugly.

This is for a separate PR (if at all), but I wonder if we should allow setting a "terminology dialect". For example, OGC uses zones / zonal_ids instead of cells / cell_ids. If we do this, I think the grid objects would be the place to make the translations.

benbovy commented 7 months ago

If we do this, I think the grid objects would be the place to make the translations.

It depends whether the terminology used is allowed to vary from one grid to another or if we want to use the same terminology for all kinds of DGGS. For the latter a global option may also work I guess.

However, I don't think we should move the conversion and validation logic to the constructor

IIUC there is nothing that would prevent a user creating a new instance with the constructor, right? This would be quite unsafe even if we strongly discourage it in the docs! Or is there a good way to raise an error when the constructor is called directly by the user? If that is too hacky / unsafe we could still implement regular classes and read-only properties..

keewis commented 7 months ago

IIUC there is nothing that would prevent a user creating a new instance with the constructor, right? This would be quite unsafe even if we strongly discourage it in the docs!

Yes, but I don't see how that would be an issue? Unless you're saying that people may want to pass values using positional arguments, but in that case we can just move to keyword-only.

keewis commented 7 months ago

For the latter a global option may also work I guess.

True. I guess what I wanted to enable was that you might work with datasets using different conventions in the same session, and want to be able to control the output for each of them. This is not something I have a use-case for currently, so I think it's fine to postpone until there is one (and/or discuss in an issue first).

benbovy commented 7 months ago

For example:

@dataclass(frozen=True)
class MyDGGSInfo(DGGSInfo):
    resolution: int

    @classmethod
    def from_dict(cls, resolution):
        if resolution < 0:
            raise ValueError("resolution must be positive")
        return cls(resolution=resolution)

grid_info = MyDGGSInfo.from_dict(resolution=-1)   # ValueError (nice!)

grid_info = MyDGGSInfo(resolution=-1)     # no error!
grid_info.resolution                      # negative!

Or is there something that I'm missing?

keewis commented 7 months ago

ah, right, indeed I didn't consider that. I agree, this kind of value verification is something that should definitely live in __post_init__. However, for that we wouldn't even need the frozen workaround.

keewis commented 3 months ago

@benbovy, this should now have progressed far enough for reviews.

I'm not really happy with the current test strategy, but I couldn't find a better, more consistent way.

keewis commented 2 months ago

merging to unblock several other things. We can revisit the other points in new PRs.

benbovy commented 2 months ago

Thanks @keewis great work!

xarray-contrib / xdggs

move towards grid objects #39