xarray-contrib / xvec

Vector data cubes for Xarray
https://xvec.readthedocs.io
MIT License
101 stars 9 forks source link

Zarr serialisation #48

Open martinfleis opened 1 year ago

martinfleis commented 1 year ago

Splitting this from #26 for a more focused discussion.

During a discussion with @rabernat earlier today, we talked about the possibility to serialise vector data cubes into a Zarr format, so I want to share a few notes and links.

Zarr is flexible enough to support this in many ways, so a few that has been discussed so far:

It may be worth ensuring whatever we prototype is done in sync with the GeoZarr efforts.

edzer commented 1 year ago

I think your first proposal makes it harder to read & write Zarr, as the de/encoder needs to read Zarr AND GeoParquet; we had this discussion before when the idea came up to use WKB. I don't think the GDAL Multidimensional Array API can do that, can xarray do it?

Given the geoarrow proposal you point to, there seem to be a strong reasons not to interleave ([x y x y ... ]). I think that when you would use separate arrays for x and y, your second and third proposal will be nearly identical: I think geoarrow lists the offset indices of each geometry, CF their lengths.

The CF convention arrays are not ragged but contiguous in memory/on disk: only when you're cutting them into the polygons or linestrings they (obviously) become ragged of some sort. CF seems the most natural to me, because those who write Zarr are aware of that spec.

rabernat commented 1 year ago

A good way to approach this would be to mirror Xarray's existing encoding / decoding mechanism. The vector indexes could exist in two states:

It would be nice if xarray allowed plugins to define custom coders. Until that happens, the coders could be implemented following the existing coding API and called manually from xvec, e.g.

ds = xr.open_dataset(netcdf_or_zarr_file)
ds = ds.xvec.decode() # -> create xvec GeometryIndex from x, y, offset arrays
ds = ds.xvec.encode() # -> back to raw form
ds.to_zarr(new_file)
edzer commented 1 year ago

Yes we can but should we? I wonder what the motivation is for drawing in non-native en/decoders: the payload is really the data cube cells, the vector geometries are only associated with (typically) a single dimension, meaning orders of magnitude smaller data than the array. Is this bikeshedding?

rabernat commented 1 year ago

I'm not sure what you mean by "non native". Could you elaborate? I'm proposing to mirror xarray's existing CF encoding / decoding mechanisms.

edzer commented 1 year ago

Maybe I'm overthinking things.

I'm proposing to mirror xarray's existing CF encoding / decoding mechanisms.

Then we're on the same page!

martinfleis commented 1 year ago

Agree with both of you.

the de/encoder needs to read Zarr AND GeoParquet

Yes, we just mentioned that it is theoretically possible but went quickly over to GeoArrow and CF conventions as a preferable solution.

when you would use separate arrays for x and y, your second and third proposal will be nearly identical

This would also be my preference. The NetCDF serialisation to match {stars} is still on the roadmap in parallel to Zarr and having both using either the same or very close representation of geometries would allow us to reuse big parts of the machinery for one in the other.

A good way to approach this would be to mirror Xarray's existing encoding / decoding mechanism.

+1

jsignell commented 1 year ago

I was taking a look at this at SciPy sprints today and tried to extend the shapely_to_cf function in cf-xarray. I ran into a bit of trouble with handling multilines (I'll try to put up a PR tomorrow).

I was eventually pointed to https://github.com/twhiteaker/CFGeom! I'm not sure what the maintenance situation is on that, but it could either be revived or used as a reference of how to convert back and forth between cf and shapely. And they already have a to_netcdf :rocket:

martinfleis commented 1 year ago

We could probably use it to ensure that the CF representation of geometries is according to spec but due to its age, the implementation itself will likely be very different now that we have shapely 2. I assume that their code may still work but it will not be as efficient as we can be these days.

We could also generate reference NetCDF files using {stars} if that is useful. And ensure it is equal to one CFGeom generates.

jsignell commented 1 year ago

Ok yeah that makes sense. I'll open a PR with what I've got. Not sure when I'll get a chance to come back to it.

martinfleis commented 1 year ago

I have tried to go the simple path and encode geometries using geo-arrow-like encoding to store in Zarr and it seems to work nicely (plus it was approved by @MSanKeys963). Below is the draft of encode/decode functions, while Zarr IO on the encoded Dataset works flawlessly.

def decode(ds):
    """Decode Dataset with Xvec indexes into GeoArrow arrays

    Parameters
    ----------
    ds : xarray.Dataset
        Dataset with Xvec-backed coordinates

    Returns
    -------
    xarray.Dataset
        Dataset with geometries represented as GeoArrow-like arrays
    """
    decoded = ds.copy()
    decoded = decoded.assign_attrs(geometry={})
    for geom_coord in ds.xvec.geom_coords:
        geom_type, coords, offsets = shapely.to_ragged_array(ds[geom_coord])
        decoded.attrs["geometry"][geom_coord] = {
            "geometry_type": geom_type.name,
            "crs": ds[geom_coord].crs.to_json_dict(),
        }
        to_assign = {
            geom_coord: numpy.arange(ds[geom_coord].shape[0]),
            f"{geom_coord}_coords_x": coords[:, 0],
            f"{geom_coord}_coords_y": coords[:, 1],
        }
        if coords.shape[1] == 3:
            to_assign[f"{geom_coord}_coords_z"] = coords[:, 2]
        for i, offsets_arr in enumerate(offsets):
            to_assign[f"{geom_coord}_offsets_{i}"] = offsets_arr

    return decoded.assign(to_assign)

def encode(ds):
    """Encode Dataset indexed by geometries represented as GeoArrow arrays into Xvec indexes 

    Parameters
    ----------
    ds : xarray.Dataset
        Dataset with geometries represented as GeoArrow-like arrays

    Returns
    -------
    xarray.Dataset
        Dataset with Xvec-backed coordinates
    """
    encoded = ds.copy()
    for geom_coord in encoded.attrs["geometry"].keys():
        offsets = [c for c in encoded.coords if f"{geom_coord}_offsets_" in c]

        geoms = shapely.from_ragged_array(
            getattr(
                shapely.GeometryType,
                encoded.attrs["geometry"][geom_coord]["geometry_type"],
            ),
            coords=numpy.vstack(
                [encoded[f"{geom_coord}_coords_x"], encoded[f"{geom_coord}_coords_y"]]
            ).T,
            offsets=tuple([encoded[c] for c in offsets]),
        )
        encoded = encoded.drop_vars(
            [
                f"{geom_coord}_coords_x",
                f"{geom_coord}_coords_y",
            ]
            + offsets
        )
        encoded[geom_coord] = geoms
        encoded = encoded.xvec.set_geom_indexes(
            geom_coord, crs=encoded.attrs["geometry"][geom_coord]["crs"]
        )
    del encoded.attrs["geometry"]
    return encoded

Using the example from our docs:

import geopandas as gpd
import xarray as xr
import xvec
import shapely
import numpy

from geodatasets import get_path

counties = gpd.read_file(get_path("geoda.natregimes"))
cube = xr.Dataset(
    data_vars=dict(
        population=(["county", "year"], counties[["PO60", "PO70", "PO80", "PO90"]]),
        unemployment=(["county", "year"], counties[["UE60", "UE70", "UE80", "UE90"]]),
        divorce=(["county", "year"], counties[["DV60", "DV70", "DV80", "DV90"]]),
        age=(["county", "year"], counties[["MA60", "MA70", "MA80", "MA90"]]),
    ),
    coords=dict(county=counties.geometry, year=[1960, 1970, 1980, 1990]),
).xvec.set_geom_indexes("county", crs=counties.crs)
cube
<xarray.Dataset>
Dimensions:       (county: 3085, year: 4)
Coordinates:
  * county        (county) object POLYGON ((-95.34258270263672 48.54670333862...
  * year          (year) int64 1960 1970 1980 1990
Data variables:
    population    (county, year) int64 4304 3987 3764 4076 ... 43766 55800 65077
    unemployment  (county, year) float64 7.9 9.0 5.903 ... 5.444 7.018 5.489
    divorce       (county, year) float64 1.859 2.62 3.747 ... 2.725 4.782 7.415
    age           (county, year) float64 28.8 30.5 34.5 ... 26.4 28.97 35.33
Indexes:
    county   GeometryIndex (crs=EPSG:4326)
arrow_like = decode(cube)
arrow_like
<xarray.Dataset>
Dimensions:           (county: 3085, year: 4, county_coords_x: 80563,
                       county_coords_y: 80563, county_offsets_0: 3173,
                       county_offsets_1: 3160, county_offsets_2: 3086)
Coordinates:
  * county            (county) int64 0 1 2 3 4 5 ... 3080 3081 3082 3083 3084
  * year              (year) int64 1960 1970 1980 1990
  * county_coords_x   (county_coords_x) float64 -95.34 -95.34 ... -111.3 -111.4
  * county_coords_y   (county_coords_y) float64 48.55 48.72 ... 44.73 44.75
  * county_offsets_0  (county_offsets_0) int64 0 24 74 142 ... 80457 80489 80563
  * county_offsets_1  (county_offsets_1) int64 0 1 2 3 4 ... 3169 3170 3171 3172
  * county_offsets_2  (county_offsets_2) int64 0 1 2 3 4 ... 3156 3157 3158 3159
Data variables:
    population        (county, year) int64 4304 3987 3764 ... 43766 55800 65077
    unemployment      (county, year) float64 7.9 9.0 5.903 ... 5.444 7.018 5.489
    divorce           (county, year) float64 1.859 2.62 3.747 ... 4.782 7.415
    age               (county, year) float64 28.8 30.5 34.5 ... 26.4 28.97 35.33
Attributes:
    geometry:  {'county': {'geometry_type': 'MULTIPOLYGON', 'crs': {'$schema'...

This form can be saved using arrow_like.to_zarr("xvec.zarr") and read back via xr.open_zarr("xvec.zarr"). Then you can convert it back.

encode(arrow_like)
<xarray.Dataset>
Dimensions:       (county: 3085, year: 4)
Coordinates:
  * county        (county) object MULTIPOLYGON (((-95.34258270263672 48.54670...
  * year          (year) int64 1960 1970 1980 1990
Data variables:
    population    (county, year) int64 4304 3987 3764 4076 ... 43766 55800 65077
    unemployment  (county, year) float64 7.9 9.0 5.903 ... 5.444 7.018 5.489
    divorce       (county, year) float64 1.859 2.62 3.747 ... 2.725 4.782 7.415
    age           (county, year) float64 28.8 30.5 34.5 ... 26.4 28.97 35.33
Indexes:
    county   GeometryIndex (crs=EPSG:4326)

The only change during the roundtrip is casting of the Polygon to MultiPolygon since GeoArrow does not support mixed single and multi-part geometry types and all are casted to multi-type if there's at least one multi-type.

It does deviate from CF conventions but uses readily available tooling in shapely. And the format is derived from GeoArrow, so there is a spec to follow.

If you find this plausible, I will convert it to a PR.

martinfleis commented 1 year ago

Another option would be to encode geometries as WKB (well-known binary) instead of a Geo-Arrow-like ragged array. That allows us to keep the dimensionality and have the decoded object look like the original one.

USing the cube from above:

cube["county"] = shapely.to_wkb(cube["county"])
cube["county"].attrs["crs"] = cube["county"].attrs["crs"].to_json()
cube.to_zarr("wkb.zarr")
cube
<xarray.Dataset>
Dimensions:       (county: 3085, year: 4)
Coordinates:
  * county        (county) object b'\x01\x03\x00\x00\x00\x01\x00\x00\x00\x18\...
  * year          (year) int64 1960 1970 1980 1990
Data variables:
    population    (county, year) int64 4304 3987 3764 4076 ... 43766 55800 65077
    unemployment  (county, year) float64 7.9 9.0 5.903 ... 5.444 7.018 5.489
    divorce       (county, year) float64 1.859 2.62 3.747 ... 2.725 4.782 7.415
    age           (county, year) float64 28.8 30.5 34.5 ... 26.4 28.97 35.33
from_wkb = xr.open_zarr("wkb.zarr")
crs = from_wkb["county"].attrs["crs"]
from_wkb["county"] = shapely.from_wkb(from_wkb["county"])
from_wkb = from_wkb.xvec.set_geom_indexes("county", crs=crs)

Anyone has a reason why we should not use WKB here? The one reason I can think of is the need to have the WKB geometry reader but if you want to create geometries you probably already have that (it is part of GEOS).

benbovy commented 1 year ago

Anyone has a reason why we should not use WKB here?

I understand why this might not be ideal / optimal but I don't see any reason not to support it. It is convenient and it "scales well" (i.e., the number variables, dimensions & attributes doesn't explode) in the case of a dataset with several n-d geometry arrays (data variables), assuming that this case is plausible.

Looking at the geoparquet spec, WKB support doesn't seem to go anywhere anytime soon and will likely co-exist with other encoding options (geoarrow) in the near future. Unless it represents a big burden to maintain, IMHO it would make sense to support those options here too (and in the geozarr-spec if it includes vector data) in addition to the CF option (via cf-xarray).

It would be nice if xarray allowed plugins to define custom coders. Until that happens, the coders could be implemented following the existing coding API and called manually from xvec

Or Xvec could provide an I/O backend, which Xarray already allows as plugin. One thing I'm not sure is whether it could be built on top of Xarray's built-in zarr backend (I mean reusing the implementation of raw zarr I/O in a new backend, not extend the API of xarray's zarr backend).

dcherian commented 3 months ago

69 builds on the hard work of @jsignell @aulemahal and I to allow encoding of vector cubes into CF compliant form that can be serialised to netCDF/Zarr or any other format that Xarray can write to really.

The core geometry encoding logic lives in cf-xarray: (docs) and CRS encoding logic lives here in xvec. It isn't very configurable at the moment but seems to work, and even includes support for multiple geometry arrays!

dcherian commented 3 months ago

Here is an interesting detail.

CF requires this ordering:

Nodes for polygon exterior rings must be put in anticlockwise order (viewed from above) and polygon interior rings in clockwise order.

but AFAICT shapely supports either order for the exterior ring, and that's the order we end up writing to disk.

So the files may be non-compliant if the input geometries are non-compliant.

martinfleis commented 3 months ago

GEOS does not care but if we want to ensure the ordering, we can check it via shapely.is_ccw and normalize it if needed, somewhere within cf_xarray's conversion function.

edzer commented 3 months ago

That is some progress at least; the clockwise/anticlockwise mess exists because shapely (and GEOS, and simple features) only works with two-dimensional flat spaces (projected data): there, every valid polygon has an unambiguous inside, regardless node ordering. When considered on the sphere or ellipsoid, every polygon divides the area in two parts, and what is considered inside the polygon needs to be found out by node ordering.

Nodes for polygon exterior rings must be put in anticlockwise order (viewed from above) and polygon interior rings in clockwise order.

This requirement still doesn't work on the sphere: suppose the polygon is the equator or a meridian: where is "above"? The logic used in s2geometry is to state that when traversing the nodes, the area to the left is "inside". That works on bounded spaces like spheres.

Another issue CF would need to clarify is how edges are defined for geodetic coordinates: are they great circles (like s2geometry does) or straight Cartesian lines (like GeoJSON rfc)?

dcherian commented 1 month ago

GEOS does not care

Nice. I think the current approach is probably fine, it ensures roundtripping of in-memory objects.The CF checker will complain, and anyone that wants fully compliant files can manually reorder

suppose the polygon is the equator or a meridian: where is "above"?

hah nice.