Open martinfleis opened 1 year ago
I think your first proposal makes it harder to read & write Zarr, as the de/encoder needs to read Zarr AND GeoParquet; we had this discussion before when the idea came up to use WKB. I don't think the GDAL Multidimensional Array API can do that, can xarray
do it?
Given the geoarrow proposal you point to, there seem to be a strong reasons not to interleave ([x y x y ... ]
). I think that when you would use separate arrays for x and y, your second and third proposal will be nearly identical: I think geoarrow lists the offset indices of each geometry, CF their lengths.
The CF convention arrays are not ragged but contiguous in memory/on disk: only when you're cutting them into the polygons or linestrings they (obviously) become ragged of some sort. CF seems the most natural to me, because those who write Zarr are aware of that spec.
A good way to approach this would be to mirror Xarray's existing encoding / decoding mechanism. The vector indexes could exist in two states:
x
, y
, offset
arrays, plus associated metadata--these are just normal array, serializable to zarr, netCDF, whateverIt would be nice if xarray allowed plugins to define custom coders. Until that happens, the coders could be implemented following the existing coding API and called manually from xvec, e.g.
ds = xr.open_dataset(netcdf_or_zarr_file)
ds = ds.xvec.decode() # -> create xvec GeometryIndex from x, y, offset arrays
ds = ds.xvec.encode() # -> back to raw form
ds.to_zarr(new_file)
Yes we can but should we? I wonder what the motivation is for drawing in non-native en/decoders: the payload is really the data cube cells, the vector geometries are only associated with (typically) a single dimension, meaning orders of magnitude smaller data than the array. Is this bikeshedding?
I'm not sure what you mean by "non native". Could you elaborate? I'm proposing to mirror xarray's existing CF encoding / decoding mechanisms.
Maybe I'm overthinking things.
I'm proposing to mirror xarray's existing CF encoding / decoding mechanisms.
Then we're on the same page!
Agree with both of you.
the de/encoder needs to read Zarr AND GeoParquet
Yes, we just mentioned that it is theoretically possible but went quickly over to GeoArrow and CF conventions as a preferable solution.
when you would use separate arrays for x and y, your second and third proposal will be nearly identical
This would also be my preference. The NetCDF serialisation to match {stars} is still on the roadmap in parallel to Zarr and having both using either the same or very close representation of geometries would allow us to reuse big parts of the machinery for one in the other.
A good way to approach this would be to mirror Xarray's existing encoding / decoding mechanism.
+1
I was taking a look at this at SciPy sprints today and tried to extend the shapely_to_cf
function in cf-xarray. I ran into a bit of trouble with handling multilines (I'll try to put up a PR tomorrow).
I was eventually pointed to https://github.com/twhiteaker/CFGeom! I'm not sure what the maintenance situation is on that, but it could either be revived or used as a reference of how to convert back and forth between cf and shapely. And they already have a to_netcdf
:rocket:
We could probably use it to ensure that the CF representation of geometries is according to spec but due to its age, the implementation itself will likely be very different now that we have shapely 2. I assume that their code may still work but it will not be as efficient as we can be these days.
We could also generate reference NetCDF files using {stars}
if that is useful. And ensure it is equal to one CFGeom generates.
Ok yeah that makes sense. I'll open a PR with what I've got. Not sure when I'll get a chance to come back to it.
I have tried to go the simple path and encode geometries using geo-arrow-like encoding to store in Zarr and it seems to work nicely (plus it was approved by @MSanKeys963). Below is the draft of encode
/decode
functions, while Zarr IO on the encoded Dataset works flawlessly.
def decode(ds):
"""Decode Dataset with Xvec indexes into GeoArrow arrays
Parameters
----------
ds : xarray.Dataset
Dataset with Xvec-backed coordinates
Returns
-------
xarray.Dataset
Dataset with geometries represented as GeoArrow-like arrays
"""
decoded = ds.copy()
decoded = decoded.assign_attrs(geometry={})
for geom_coord in ds.xvec.geom_coords:
geom_type, coords, offsets = shapely.to_ragged_array(ds[geom_coord])
decoded.attrs["geometry"][geom_coord] = {
"geometry_type": geom_type.name,
"crs": ds[geom_coord].crs.to_json_dict(),
}
to_assign = {
geom_coord: numpy.arange(ds[geom_coord].shape[0]),
f"{geom_coord}_coords_x": coords[:, 0],
f"{geom_coord}_coords_y": coords[:, 1],
}
if coords.shape[1] == 3:
to_assign[f"{geom_coord}_coords_z"] = coords[:, 2]
for i, offsets_arr in enumerate(offsets):
to_assign[f"{geom_coord}_offsets_{i}"] = offsets_arr
return decoded.assign(to_assign)
def encode(ds):
"""Encode Dataset indexed by geometries represented as GeoArrow arrays into Xvec indexes
Parameters
----------
ds : xarray.Dataset
Dataset with geometries represented as GeoArrow-like arrays
Returns
-------
xarray.Dataset
Dataset with Xvec-backed coordinates
"""
encoded = ds.copy()
for geom_coord in encoded.attrs["geometry"].keys():
offsets = [c for c in encoded.coords if f"{geom_coord}_offsets_" in c]
geoms = shapely.from_ragged_array(
getattr(
shapely.GeometryType,
encoded.attrs["geometry"][geom_coord]["geometry_type"],
),
coords=numpy.vstack(
[encoded[f"{geom_coord}_coords_x"], encoded[f"{geom_coord}_coords_y"]]
).T,
offsets=tuple([encoded[c] for c in offsets]),
)
encoded = encoded.drop_vars(
[
f"{geom_coord}_coords_x",
f"{geom_coord}_coords_y",
]
+ offsets
)
encoded[geom_coord] = geoms
encoded = encoded.xvec.set_geom_indexes(
geom_coord, crs=encoded.attrs["geometry"][geom_coord]["crs"]
)
del encoded.attrs["geometry"]
return encoded
Using the example from our docs:
import geopandas as gpd
import xarray as xr
import xvec
import shapely
import numpy
from geodatasets import get_path
counties = gpd.read_file(get_path("geoda.natregimes"))
cube = xr.Dataset(
data_vars=dict(
population=(["county", "year"], counties[["PO60", "PO70", "PO80", "PO90"]]),
unemployment=(["county", "year"], counties[["UE60", "UE70", "UE80", "UE90"]]),
divorce=(["county", "year"], counties[["DV60", "DV70", "DV80", "DV90"]]),
age=(["county", "year"], counties[["MA60", "MA70", "MA80", "MA90"]]),
),
coords=dict(county=counties.geometry, year=[1960, 1970, 1980, 1990]),
).xvec.set_geom_indexes("county", crs=counties.crs)
cube
<xarray.Dataset>
Dimensions: (county: 3085, year: 4)
Coordinates:
* county (county) object POLYGON ((-95.34258270263672 48.54670333862...
* year (year) int64 1960 1970 1980 1990
Data variables:
population (county, year) int64 4304 3987 3764 4076 ... 43766 55800 65077
unemployment (county, year) float64 7.9 9.0 5.903 ... 5.444 7.018 5.489
divorce (county, year) float64 1.859 2.62 3.747 ... 2.725 4.782 7.415
age (county, year) float64 28.8 30.5 34.5 ... 26.4 28.97 35.33
Indexes:
county GeometryIndex (crs=EPSG:4326)
arrow_like = decode(cube)
arrow_like
<xarray.Dataset>
Dimensions: (county: 3085, year: 4, county_coords_x: 80563,
county_coords_y: 80563, county_offsets_0: 3173,
county_offsets_1: 3160, county_offsets_2: 3086)
Coordinates:
* county (county) int64 0 1 2 3 4 5 ... 3080 3081 3082 3083 3084
* year (year) int64 1960 1970 1980 1990
* county_coords_x (county_coords_x) float64 -95.34 -95.34 ... -111.3 -111.4
* county_coords_y (county_coords_y) float64 48.55 48.72 ... 44.73 44.75
* county_offsets_0 (county_offsets_0) int64 0 24 74 142 ... 80457 80489 80563
* county_offsets_1 (county_offsets_1) int64 0 1 2 3 4 ... 3169 3170 3171 3172
* county_offsets_2 (county_offsets_2) int64 0 1 2 3 4 ... 3156 3157 3158 3159
Data variables:
population (county, year) int64 4304 3987 3764 ... 43766 55800 65077
unemployment (county, year) float64 7.9 9.0 5.903 ... 5.444 7.018 5.489
divorce (county, year) float64 1.859 2.62 3.747 ... 4.782 7.415
age (county, year) float64 28.8 30.5 34.5 ... 26.4 28.97 35.33
Attributes:
geometry: {'county': {'geometry_type': 'MULTIPOLYGON', 'crs': {'$schema'...
This form can be saved using arrow_like.to_zarr("xvec.zarr")
and read back via xr.open_zarr("xvec.zarr")
. Then you can convert it back.
encode(arrow_like)
<xarray.Dataset>
Dimensions: (county: 3085, year: 4)
Coordinates:
* county (county) object MULTIPOLYGON (((-95.34258270263672 48.54670...
* year (year) int64 1960 1970 1980 1990
Data variables:
population (county, year) int64 4304 3987 3764 4076 ... 43766 55800 65077
unemployment (county, year) float64 7.9 9.0 5.903 ... 5.444 7.018 5.489
divorce (county, year) float64 1.859 2.62 3.747 ... 2.725 4.782 7.415
age (county, year) float64 28.8 30.5 34.5 ... 26.4 28.97 35.33
Indexes:
county GeometryIndex (crs=EPSG:4326)
The only change during the roundtrip is casting of the Polygon to MultiPolygon since GeoArrow does not support mixed single and multi-part geometry types and all are casted to multi-type if there's at least one multi-type.
It does deviate from CF conventions but uses readily available tooling in shapely. And the format is derived from GeoArrow, so there is a spec to follow.
If you find this plausible, I will convert it to a PR.
Another option would be to encode geometries as WKB (well-known binary) instead of a Geo-Arrow-like ragged array. That allows us to keep the dimensionality and have the decoded object look like the original one.
USing the cube
from above:
cube["county"] = shapely.to_wkb(cube["county"])
cube["county"].attrs["crs"] = cube["county"].attrs["crs"].to_json()
cube.to_zarr("wkb.zarr")
cube
<xarray.Dataset>
Dimensions: (county: 3085, year: 4)
Coordinates:
* county (county) object b'\x01\x03\x00\x00\x00\x01\x00\x00\x00\x18\...
* year (year) int64 1960 1970 1980 1990
Data variables:
population (county, year) int64 4304 3987 3764 4076 ... 43766 55800 65077
unemployment (county, year) float64 7.9 9.0 5.903 ... 5.444 7.018 5.489
divorce (county, year) float64 1.859 2.62 3.747 ... 2.725 4.782 7.415
age (county, year) float64 28.8 30.5 34.5 ... 26.4 28.97 35.33
from_wkb = xr.open_zarr("wkb.zarr")
crs = from_wkb["county"].attrs["crs"]
from_wkb["county"] = shapely.from_wkb(from_wkb["county"])
from_wkb = from_wkb.xvec.set_geom_indexes("county", crs=crs)
Anyone has a reason why we should not use WKB here? The one reason I can think of is the need to have the WKB geometry reader but if you want to create geometries you probably already have that (it is part of GEOS).
Anyone has a reason why we should not use WKB here?
I understand why this might not be ideal / optimal but I don't see any reason not to support it. It is convenient and it "scales well" (i.e., the number variables, dimensions & attributes doesn't explode) in the case of a dataset with several n-d geometry arrays (data variables), assuming that this case is plausible.
Looking at the geoparquet spec, WKB support doesn't seem to go anywhere anytime soon and will likely co-exist with other encoding options (geoarrow) in the near future. Unless it represents a big burden to maintain, IMHO it would make sense to support those options here too (and in the geozarr-spec if it includes vector data) in addition to the CF option (via cf-xarray).
It would be nice if xarray allowed plugins to define custom coders. Until that happens, the coders could be implemented following the existing coding API and called manually from xvec
Or Xvec could provide an I/O backend, which Xarray already allows as plugin. One thing I'm not sure is whether it could be built on top of Xarray's built-in zarr backend (I mean reusing the implementation of raw zarr I/O in a new backend, not extend the API of xarray's zarr backend).
The core geometry encoding logic lives in cf-xarray: (docs) and CRS encoding logic lives here in xvec. It isn't very configurable at the moment but seems to work, and even includes support for multiple geometry arrays!
Here is an interesting detail.
Nodes for polygon exterior rings must be put in anticlockwise order (viewed from above) and polygon interior rings in clockwise order.
but AFAICT shapely supports either order for the exterior ring, and that's the order we end up writing to disk.
So the files may be non-compliant if the input geometries are non-compliant.
GEOS does not care but if we want to ensure the ordering, we can check it via shapely.is_ccw
and normalize it if needed, somewhere within cf_xarray's conversion function.
That is some progress at least; the clockwise/anticlockwise mess exists because shapely (and GEOS, and simple features) only works with two-dimensional flat spaces (projected data): there, every valid polygon has an unambiguous inside, regardless node ordering. When considered on the sphere or ellipsoid, every polygon divides the area in two parts, and what is considered inside the polygon needs to be found out by node ordering.
Nodes for polygon exterior rings must be put in anticlockwise order (viewed from above) and polygon interior rings in clockwise order.
This requirement still doesn't work on the sphere: suppose the polygon is the equator or a meridian: where is "above"? The logic used in s2geometry is to state that when traversing the nodes, the area to the left is "inside". That works on bounded spaces like spheres.
Another issue CF would need to clarify is how edges are defined for geodetic coordinates: are they great circles (like s2geometry does) or straight Cartesian lines (like GeoJSON rfc)?
GEOS does not care
Nice. I think the current approach is probably fine, it ensures roundtripping of in-memory objects.The CF checker will complain, and anyone that wants fully compliant files can manually reorder
suppose the polygon is the equator or a meridian: where is "above"?
hah nice.
Splitting this from #26 for a more focused discussion.
During a discussion with @rabernat earlier today, we talked about the possibility to serialise vector data cubes into a Zarr format, so I want to share a few notes and links.
Zarr is flexible enough to support this in many ways, so a few that has been discussed so far:
[x, y, x, y, ...]
, pay also attention to https://github.com/geoarrow/geoarrow/pull/26.It may be worth ensuring whatever we prototype is done in sync with the GeoZarr efforts.