Open martinfleis opened 1 year ago
Hi @martinfleis -- not sure if this helps (or if this is duplicative with functionality in this repo) but I developed these two functions to convert GeoTiff rasters to square vector geometries using xarray (so the flow is, GeoTiff > xarray > pd.DataFrame > gpd.GeoDataFrame > GeoParquet). Below is a reproducible example working with a population raster dataset. I'm sure the process could be improved and generalized further, but figured I'd share in case its useful.
Nico
# %%
import pygeos
import numpy as np
import pandas as pd
import geopandas as gpd
import rasterio as rio
from rasterio import features
import xarray as xr
import rioxarray as rxr
# %%
def vectorize_pixels(xr_array: xr.DataArray) -> xr.Dataset:
"""Vectorize pixel coordinates in a xarray.DataArray raster
Args:
xr_array: xarray.DataArray of target raster containing dimensions 'y' and 'x'
Returns:
xarray.Dataset with corner coordinates of each pixel as data variables
"""
bounds = xr_array.rio.bounds()
transform_constructor = rio.transform.from_bounds(west = bounds[0], south = bounds[1], east = bounds[2], north = bounds[3], width = len(xr_array['x']), height = len(xr_array['y']))
height = len(xr_array['y'])
width = len(xr_array['x'])
cols, rows = np.meshgrid(np.arange(width), np.arange(height))
xs, ys = rio.transform.xy(transform = transform_constructor, rows = rows, cols = cols, offset = 'center')
xs_ul, ys_ul = rio.transform.xy(transform = transform_constructor, rows = rows, cols = cols, offset = 'ul')
xs_ur, ys_ur = rio.transform.xy(transform = transform_constructor, rows = rows, cols = cols, offset = 'ur')
xs_ll, ys_ll = rio.transform.xy(transform = transform_constructor, rows = rows, cols = cols, offset = 'll')
xs_lr, ys_lr = rio.transform.xy(transform = transform_constructor, rows = rows, cols = cols, offset = 'lr')
lons = np.array(xs)
lats = np.array(ys)
lons_ul = np.array(xs_ul)
lats_ul = np.array(ys_ul)
lats_ll = np.array(ys_ll)
lons_ur = np.array(xs_ur)
dims_remove = [x for x in list(xr_array.dims) if x not in set(['x','y'])]
xr_data = xr_array.to_dataset(name='data').drop_dims(drop_dims = dims_remove)
xr_data['ymax'] = xr.DataArray(data=lats_ul, coords = xr_data.coords, dims=('y', 'x'))
xr_data['ymin'] = xr.DataArray(data=lats_ll, coords = xr_data.coords, dims=('y', 'x'))
xr_data['xmax'] = xr.DataArray(data=lons_ur, coords = xr_data.coords, dims=('y', 'x'))
xr_data['xmin'] = xr.DataArray(data=lons_ul, coords = xr_data.coords, dims=('y', 'x'))
return xr_data
def shapelify_pixels(pd_data: pd.DataFrame) -> gpd.GeoDataFrame:
"""Convert pandas.DataFrame containing bounding box columns to a geopandas.GeoDataFrame
Args:
pd_data: pandas.DataFrame returned from merge_pixels() function
Returns:
pandas.DataFrame converted to geopandas.GeoDataFrame with pixels formatted as geometries
"""
coords = [pygeos.box(xmin, ymin, xmax, ymax) for xmin, ymin, xmax, ymax in zip(pd_data['xmin'], pd_data['ymin'], pd_data['xmax'], pd_data['ymax'])]
data = gpd.GeoDataFrame(pd_data, geometry = pygeos.to_shapely(coords), crs ="EPSG:4326")
return data
# %%
# Download data
# wget -O /Users/nm/Downloads/worldpop/DJI.tif https://data.worldpop.org/GIS/Population/Global_2000_2020_Constrained/2020/maxar_v1/DJI/dji_ppp_2020_UNadj_constrained.tif
# Read in geotif data
xr_array = rxr.open_rasterio(filename = '/Users/nm/Downloads/worldpop/DJI.tif')
# %%
# Vectorize geotif
raster_geoms = vectorize_pixels(xr_array = xr_array)
xr_array.name = 'djipop'
raster = xr.merge(objects = [xr_array, raster_geoms], join= 'left')
# %%
# Convert xarray to df
df = raster.to_dataframe()
# Convert df to gdf
gdf = shapelify_pixels(pd_data = df)
# %%
# Plot as a gdf
gdf[gdf['djipop'] > 0].plot(column = 'djipop', cmap='viridis').reset_index()
I'm interested in following this. Do you have an example of an existing NetCDF file that has the sort of encoded geometries you'd like to produce? That would be a very useful place to start. Presumably Xarray can open that file--it just doesn't decode the data into anything useful for computation.
I think the general pattern we're looking for is to create an xarray accessor that manages the encoding / decoding of the geometry data from this "raw" state (which is serializable) to something useful, which is not. The pint-xarray package provides a template of what this might look like.
# Open the raw data,
ds = xr.open_dataset("encoded.nc")
# convert the geometry data to something useful
ds_decoded = ds.xvec.decode_geometries()
# the inverse operation
ds_encoded = ds_decoded.xvec.encode_geometries()
xr.testing.assert_equal(ds, ds_encoded)
ds_encoded.to_netcdf("round_trip.nc")
To make this more convenient for the user, this could be wrapped in an xarray pluggable backend for I/O.
One problem with our ecosystem is that there is no one central package that manages this sort of decoding for geospatial data. Rioxarray does a bunch, leveraging gdal. Odc-geo does things a little differently. Metpy has its own handling of CRS and other geo stuff. So does CF Xarray.
Makes me wish geo-xarray had not turned into a zombie package. It's really needed.
Anyway, I recommend following the path I sketched above to prototype this functionality.
think the general pattern we're looking for is to create an xarray accessor that manages the encoding / decoding of the geometry data from this "raw" state
@aulemahal already taught CF-xarray to do this but only for "point geometries". IMO we should extend it to support other things (PR welcome!). Granted it's a helper function right now, but we could easily add .cf.encode(shapely_to_geometry=True)
and .cf.decode(geometry_to_shapely=True)
.
Part of the point of cf-xarray is to have a single place for these kinds of utilities that can be used in many places.
Just for reference, the other day I had a very similar question to @edzer in the stars
repo. https://github.com/r-spatial/stars/issues/611 . For now, like in Python, in R the only way to save these cubes is in R's pickle
- RDS
. It would be great to have an interoperable format, perhaps even based on arrow/parquet, rather than NetCDF.
For now, like in Python, in R the only way to save these cubes is in R's pickle - RDS
@e-kotov no, that is not what I said, this only applies to the case where more than one dimension is associated with vector geometries.
For the (most common) use case of a single vector dimension, CDL as described in the CF conventions @martinfleis refers to can be perfectly used, leading to NetCDF and Zarr implementations (they both follow CDL / CF). Special cases such as weird raster cells specified by their vertices - think hexagons, DGGS - can also be covered by that.
If we want to write and read more than one vector geometry dimension we can still use CDL (NetCDF, Zarr) but would have to go beyond CF conventions. Nothing wrong with that, if we have a strong use case we can propose to extend CF.
In R, I implemented stars::write_mdim()
and stars::read_mdim()
to write & read vector data cubes (with one vector geometry dimension). It uses the GDAL C++ multidimensional array API, as it is already using GDAL for most I/O, but could as well have used the NetCDF library. The encoding of points, multilinestrings and multipolygons in CDL arrays is pretty trivial: you just write out all points (vertices), and then index arrays that point to the start of linestrings or polygons, and for polygons if holes are present indexes to the rings that are holes - you get the idea.
Thanks all!
We can split the issue into two problems:
For the first one, I have crated an example NetCDF encoding polygons using this R code (in rocker/geospatial
container).
library(stars)
nc = st_read(system.file("gpkg/nc.gpkg", package="sf"))
location = st_geometry(nc)
mode = c("car", "bike", "foot")
day = 1:100
hour = 0:23
dims = st_dimensions(location = location, mode = mode, day = day, hour = hour)
n = dim(dims)
traffic = array(rpois(prod(n), 10), dim = n) # simulated traffic counts
ds = st_as_stars(list(traffic = traffic), dimensions = dims)
write_mdim(ds, "stars_cube.nc")
The file is attached (had to manually zip it to upload) -> stars_cube.nc.zip
Xarray does read it and it seems that we should be fairly easily able to convert that information into an array of shapely geometries with pyproj.CRS assigned via xvec.
ds_stars = xr.open_dataset("stars_cube.nc")
print(ds_stars)
<xarray.Dataset>
Dimensions: (hour: 24, day: 100, mode: 3, location: 100, node: 2529,
part: 108)
Coordinates:
* location (location) float32 1.0 2.0 3.0 4.0 ... 97.0 98.0 99.0 100.0
* mode (mode) object 'car' 'bike' 'foot'
* day (day) float32 1.0 2.0 3.0 4.0 5.0 ... 97.0 98.0 99.0 100.0
* hour (hour) float32 0.0 1.0 2.0 3.0 4.0 ... 20.0 21.0 22.0 23.0
* node (node) float32 1.0 2.0 3.0 ... 2.528e+03 2.529e+03
x (node) float32 ...
y (node) float32 ...
Dimensions without coordinates: part
Data variables:
traffic (hour, day, mode, location) float32 ...
crs |S1 ...
node_count (location) float32 ...
part_node_count (part) float32 ...
interior_ring (part) float32 ...
geometry float32 ...
Attributes:
Conventions: CF-1.6
where
print(ds_stars.crs)
<xarray.DataArray 'crs' ()>
[1 values with dtype=|S1]
Attributes:
grid_mapping_name: latitude_longitude
long_name: CRS definition
longitude_of_prime_meridian: 0.0
semi_major_axis: 6378206.4
inverse_flattening: 294.978698213898
spatial_ref: GEOGCS["NAD27",DATUM["North_American_Datum_...
crs_wkt: GEOGCS["NAD27",DATUM["North_American_Datum_...
can be interpreted by pyproj and x
, y
, node_count
, part_node_count
and interior_ring
can be used to construct shapely geometry. The latter should be ideally implemented in cf_xarray.cf_to_shapely() and included in the decode
logic @rabernat posted above.
And then do the same the other way.
For the second one (multiple geometry dimension in a file), I believe that should be discussed separately and probably somewhere else. We can then implement here or in other packages whatever spec a community agrees on.
What about a custom codec (filter) for encoding / decoding the 1-d geometry coordinates of the vector data cubes?
Regarding zarr (python), this can be registered dynamically via numcodec. For hdf5 / netcdf4 I'm less sure but it looks like custom filters are supported too (e.g., https://github.com/silx-kit/hdf5plugin)?
The geoparquet format currently encodes geometries in the WKB format, which is used by geopandas to read/write parquet and feather files. For a Python/Zarr prototype, I guess we could adapt some logic from geopandas and register a custom numcodec codec that encodes (decodes) to (from) WKB? Such a codec would nicely fit in Xvec actually. Since it is handled by zarr, there wouldn't be anything specific to do on the Xarray side to convert the raw vector data. Perhaps the same can be done for h5py (netcdf4)?
I'm not sure how best to deal with metadata, though. The geozarr specs seems a nice place to discuss / address this. For vector-specific metadata, it might be worth looking at the geoarrow specs.
I'm not familiar with the CF conventions for geometries, but at a first glance it looks a bit more convoluted (and maybe less flexible?) than the custom codec approach.
More generally, it would be nice if we could leverage the best from both Arrow/Parquet and Zarr (HDF5/NetCDF). Those formats are de-facto pretty standard or on the way of being so. It that's possible, it looks like an option that would be both efficient and reasonably portable for vector data cube I/O.
It would be great to have an interoperable format, perhaps even based on arrow/parquet, rather than NetCDF.
I agree interoperability is a great goal to strive for. I have been assuming that data cubes cannot be serialized to Parquet / Arrow because those formats are fundamentally aimed at dataframes (and not ND-Arrays). Maybe vector data are different however. If your data fit into Parquet, Pandas is probably the right choice, rather than Xarray.
Xarray's internal data model is based on NetCDF. However, we can serialize this data model to many different formats. Something like this.
graph TD;
CF-conventions--> NetCDF-Data-Model;
NetCDF-Data-Model-->NetCDF3-Binary-Format
NetCDF-Data-Model-->HDF5;
NetCDF-Data-Model-->Zarr;
NetCDF-Data-Model-->Arrow?
That said, I'd love for someone to show me how to serialize the NetCDF data model to Arrow / Parquet.
What about a custom codec (filter) for encoding / decoding the 1-d geometry coordinates of the vector data cubes?
This is an interesting idea. Should all of Xarray's CF decoding be re-implemented as custom codecs? I tend to think of codecs as decompression-related. Packing too much logic into the codec concept feels a bit fragile. On the other hand, Xarray's convention encoding / decoding logic is remarkably similar to how codecs work, so maybe this is worth pursuing 🤔 .
I'm not familiar with the CF conventions for geometries, but at a first glance it looks a bit more convoluted (and maybe less flexible?) than the custom codec approach.
One additional point... If an existing standard exists for this problem we should absolutely try to use it, rather than inventing a new standard! I don't need to share the XKCD cartoon which we all know by heart by now.
Edit: I read @edzer's comment more carefully:
If we want to write and read more than one vector geometry dimension we can still use CDL (NetCDF, Zarr) but would have to go beyond CF conventions.
...and now I understand that indeed a new standard is needed. My experience with the CF conventions is that the timescale for something new to be included is about 2 years. I assume we want to move quicker here, so coming up with an ad-hoc extension to those conventions and demonstrating interoperability between R and Python seems like a great place to start.
What about a custom codec (filter) for encoding / decoding the 1-d geometry coordinates of the vector data cubes?
I think the strength of NetCDF and Zarr is that you only need the NetCDF library to read everything, nothing else. If you would introduce something like WKB in it (assuming you can: with a bit array? But how to deal with the ragged nature?) would mean that someone just using the NetCDF library no longer can interpret the geometries. With the CF geometry conventions, you can. I wouldn't start thinking about giving up this level of interoperability: NetCDF will soon turn 50.
The CF geometry conventions are really, AFAICT, the simplest you can get. I believe that Arrow takes the same approach for speed reasons (right @paleolimbot?). The advantage over WKB is that with explicit indexes you can easily (directly) jump to the last sub-polygon, for instance; with WKB you'd have to go through the full set of rings before finding the last. These CF geometries give up on the difference between LINESTRING and MULTILINESTRING, and between POLYGON and MULTIPOLYGON, so no clean roundtrips of simple features but you could argue that those distinctions are a luxury (and often a hassle) anyway.
@martinfleis great example; for testing it would make sense to also have a polygon with holes, which this dataset doesn't have.
For the second one (multiple geometry dimension in a file), I believe that should be discussed separately and probably somewhere else. We can then implement here or in other packages whatever spec a community agrees on.
I'll see if I can think of an approach how this could be done with CDL / NetCDF building on the CF approach. Maybe adding one (top) level of nesting, indicating the dimension involved, would be enough.
@rabernat I also understand that Parquet/Arrow cater for columnar data, not arrays. You can put an array in a column but you'll loose all its advantages.
I think that the only thing we could borrow from Arrow is Geo-Arrow geometry encoding but it is incompatible with CF encoding. Arrow uses the interleaved coordinate values ([x, y, x, y, ...],). Btw, this discussion may be relevant as well on saving raster as Arrow https://github.com/geoarrow/geoarrow/issues/24
I would also try to stick as much was we can with existing standards. The CF geometry encoding looks straightforward to work with and if we can find a way of saving multiple dimensions, we can implement it in R and Python as a de facto standard and try to get it in CF conventions later, if that would slow down things. We just need to ensure that our extension does not break reading of the same file elsewhere.
think the strength of NetCDF and Zarr is that you only need the NetCDF library to read everything, nothing else. If you would introduce something like WKB in it (assuming you can: with a bit array? But how to deal with the ragged nature?) would mean that someone just using the NetCDF library no longer can interpret the geometries. With the CF geometry conventions, you can. I wouldn't start thinking about giving up this level of interoperability: NetCDF will soon turn 50.
Fair enough. You have more experience than me with WKB, CF (geometry), etc. to know what would be the best option.
Arrow uses the interleaved coordinate values ([x, y, x, y, ...],)
Is it settled? There seems to be a long unresolved discussion here: https://github.com/geoarrow/geoarrow/pull/26.
I was also naively wondering how compatible are the CF conventions vs. geo-arrow specs for encoding geometries and how concerning would be a poor compatibility regarding performance and/or other aspects? Especially if we expect many back and forth conversions between data cubes <-> data frames in workflows that involve vector data (usually more than for gridded/raster data)?
Arrow uses the interleaved coordinate values ([x, y, x, y, ...],) Is it settled? There seems to be a long unresolved discussion here: https://github.com/geoarrow/geoarrow/pull/26.
Probably not entirely (I guess that is the main reason why GeoParquet 1.0 still uses WKB) but I lost track of that a bit. I think that at least GeoPolars and cuSpatial already use the interleaved option.
I was also naively wondering how compatible are the CF conventions vs. geo-arrow specs for encoding geometries and how concerning would be a poor compatibility regarding performance and/or other aspects? Especially if we expect many back and forth conversions between data cubes <-> data frames in workflows that involve vector data (usually more than for gridded/raster data)?
We have three sides in here, not two. One is CF representation, another Geo-Arrow but we also need to keep in mind that in most cases, we need GEOS geometries to work with (that applies to both Python and R). So imho fast conversion between CF and GEOS (shapely) is more important in our ecosystem than CF to Geo-Arrow. And for shapely, having separate x and y arrays is better than an interleaved array that would need to be split into separate arrays anyway (or 2D array but that still involves reshuffling).
So imho fast conversion between CF and GEOS (shapely) is more important in our ecosystem than CF to Geo-Arrow.
Yes that's true for shapely. For S2 (s2geography / spherely) I think it is possible to reuse external, in-memory coordinate data. I vaguely remember about @paleolimbot sharing some ideas about using S2 directly with Arrow structures (https://dewey.dunnington.ca/post/2021/prototyping-an-apache-arrow-representation-of-geometry/).
I just wanted to chime in that if interleaved coordinates become a blocking feature for a lot of people toward using GeoArrow, that's a strong reason to drop that requirement.
The CF geometry conventions are really, AFAICT, the simplest you can get. I believe that Arrow takes the same approach for speed reasons (right paleolimbot?)
Other than just reading them now, I'm not all that familiar with the CF conventions for geometry...it looks like the only difference is that it stores "lengths" instead of "cumulative lengths", and as @edzer noted, there is no distinction between multi and non-multi types and no support for mixed types. I think Edzer's point about interoperability is huge: no need for a WKB parser! And nothing does ndarrays like NetCDF. From reading this thread I can't help but side with NetCDF + CF conventions (but I'm very new to this discussion).
I vaguely remember about paleolimbot sharing some ideas about using S2 directly with Arrow structures
Yes, S2 is very flexible about its data structure (basically every edge is a virtual method call...it makes the bet that many of those edges will never need to be resolved). One could write an S2Geography subclass that lazily reads from the file. That said, I don't think it's a good use of anybody's time to actually do that.
Is it settled? There seems to be a long unresolved discussion
My current prediction is that the GeoArrow specification will allow for both, and there will be a small vendorable library that will make it trivial to translate between WKB, WKT, and both forms of coordinate representations. That small vendorable library ( https://github.com/geoarrow/geoarrow-cpp ) is progressing as fast as my kids capability for napping/sleeping allows, which is at present very very slow.
Hi All. I think the discussion here is largely on target. We designed the CF geometry spec with simplicity and the 90% level of use case support in mind. @edzer is totally on point with regards to the design theory and accessibility of the encoding here. https://github.com/martinfleis/xvec/issues/26#issuecomment-1422599146 but not quite correct that multi types aren’t supported. See the examples in the vignette linked below for details.
A couple resources.
A poster that Tim and I prepared that talks about the use cases and implementation strategy. https://www.unidata.ucar.edu/mailing_lists/archives/netcdf-java/2018/pdfxJp5EmUlti.pdf
An R package vignette for my R package that implements the core CF geometry spec. The package is built into {stars} pure netcdf reader. Th package also writes the netcdf geometry format with time series. This vignette and the package test data might be a good place for someone to start. I think the same tests are in Tim’s package. https://doi-usgs.github.io/ncdfgeom/articles/geometry.html
Tim’s python reference implementation docs. https://twhiteaker.github.io/CFGeom/tutorial.html
Happy to help clarify things or cross validate tests.
Note that I also worked with a student a few years back to get the core of the cf geometry into gdal (for better or worse) 😅
The format really hasn’t caught hold, but I still think it has a place in the mix. It’s dead simple and pretty easy to read / write from wkb/wkt structures.
Noting some comments about adding further dimensionality. No reason not to. Just call it a custom extension and move on. If it catches on, bring it to CF. I’d be happy to help evolve that part of the spec.
Cheers!
I'm still working on wrapping my head around the specifics of vector data cubes (and in general rather unfamiliar with CF etc), but naively it seems like you might still be able to use Arrow.
In a simple case, where the geometry column maps against the first axis of the data cube, you could store the data cube in a nested fixed size list column (essentially the raster/tensor type). Then as I understand it, the entire column's data buffer would be equivalent to a C-order multidimensional array. You could store the geometries in any format you'd like.
I think that at least GeoPolars and cuSpatial already use the interleaved option.
Right now geopolars uses a separated/struct array because polars doesn't support the FixedSizeList
type. I hope to implement that in polars at some point to enable using interleaved arrays.
I played around a bit with cf_xarray and xvec and made the following example work. The sample file I'm using is from the netcdfgeom R package: https://github.com/DOI-USGS/ncdfgeom/tree/main/inst/extdata
Note that this file doesn't actually conform to the CF geometries spec, so it has to be augmented before it can be used.
import xarray as xr
import cf_xarray
import xvec
# to download
# wget https://github.com/DOI-USGS/ncdfgeom/raw/main/inst/extdata/example_huc_eta.nc
ds = xr.open_dataset('example_huc_eta.nc')
# assign missing geometry_container variable
# Note: we omit node_count because these are point data
ds.coords['geometry_container'] = 0
ds.geometry_container.attrs['geometry_type'] = 'point'
ds.geometry_container.attrs['node_coordinates'] = 'lat lon'
# use cf_xarray to convert to shapely geometries
geometries = cf_xarray.cf_to_shapely(ds)
# (because node_count was missing, the dimension name gets auto-assigned to "features")
# now make this a coordinate variable on the original dataset
ds.coords['geometry'] = geometries.rename({"features": "station"})
# Now make this a dimension, which creates a new Pandas Index
# (It doesn't make sense to me to first create the pandas index. Is this efficient?)
ds_swapped = ds.swap_dims({"station": "geometry"})
# now set the geometry index
ds_cube = ds_swapped.xvec.set_geom_indexes("geometry", crs="4326")
This shows how it is plausible to decode CF geometries from any Xarray dataset into an xvec GeometryIndex
. Here are some potential next steps.
point
. Here I wonder if geoarrow could help with a very efficient decoding of the raw array data into shapely geometries. @paleolimbot - does that seem like the right place to plug in geoarrow?decode_geometries
and encode_geometries
to facilitate going back and forth between the encoded / decoded representations. (This is just some syntactic sugar around code like what I wrote above, which is essentially decode_geometries
. At this point, we can serialize the geometries into any file format supported by Xarray (mainly NetCDF and Zarr [see #48]) using the same machinery.Once those two are done, we would want to decide where to create the xvec indexes. It could possibly be done by cf_xarray during the decode_geometries
step. Or it could be a separate step. For example
ds = xr.open_dataset("file_with_cf_geometries.nc")
ds = ds.cf.decode_geometries()
ds = ds.xvec.set_geom_indexes()
does that seem like the right place to plug in geoarrow?
I think geoarrow can currently get you a pyarrow.Array
from the types of buffers you're describing (with the caveat that it's not on pypi or conda-forge yet):
python -m pip install "https://github.com/geoarrow/geoarrow-c/archive/refs/heads/main.zip#egg=geoarrow&subdirectory=python"
>>> import geoarrow.pyarrow as ga
>>> import numpy as np
>>> ga.linestring().from_geobuffers(None,np.array([0, 2, 5], dtype=np.int32), np.array([1, 2, 3, 4, 5], dtype=np.double), np.array([6, 7, 8, 9, 10], dtype=np.double))
LinestringArray:LinestringType(geoarrow.linestring)[2]
<LINESTRING (1 6, 2 7)>
<LINESTRING (3 8, 4 9, 5 10)>
Recent versions of pandas
and pyarrow
can get you a pandas.Series
backed by a pyarrow.Array
or pyarrow.ChunkedArray
or you could convert to geopandas (slow because of dtype=object
). I don't know what the constraint is on the end value type for xarray...if you can avoid dtype=object you will probably be a lot happier.
Currently the way this works (sans any arrow) is something like:
cf_xarray.cf_to_shapely
consumes those and produces a numpy array with object dtype. Each object is a shapely.geometry.point.Point
. This definitely requires new memory allocation.ds.swap_dims
turns that array into a PandasIndex
. Its repr looks like PandasIndex(Index([POINT (36.488959 -80.399735), POINT (36.43594 -80.365249)], dtype='object', name='geometry'))
. Is a copy involved here? 🤷 xvec.set_geom_indexes
takes this and turns it into an xvec.index.GeometryIndex
. Is a copy involved here? 🤷It's unclear to me how best to plug in Arrow, since Xarray is not fundamentally a dataframe library. Presumbly Arrow could help by reducing the amount of memory copies? Or make step 4 faster?
I'm rather new to the Python universe and so I'm not sure what the best way to plug in Arrow is either. It is true that the pyarrow.Array
that from_geobuffers()
does not copy the 1D numpy arrays it is given as input. pyarrow is also a convenience here...geoarrow is implemented in C and produces C structures converted to pyarrow at the Python level.
If the ultimate endpoint always has to be numpy array with dtype=object, implementing from_ragged_array with non-interleaved coordinates will probably be a lot simpler than invoking geoarrow/arrow. If a pandas.Series is acceptable, there may be some optimizations that are possible to avoid copies.
AFAIK there's not much constraints on the Xarray side regarding the types of the array supported. I think any type implementing the Python (Numpy) array API could in principle be wrapped in an Xarray variable.
If the goal is to use it with an Xarray index, there is even less constraint: Xarray internally relies on some adapters to reuse Pandas (multi-)index objects within both Xarray variables and indexes so that they share the same data. Likewise, it should also be possible to reuse a pyarrow.Array
in an xvec.GeometryIndex
and its corresponding Xarray variable. If that's not currently possible we should allow it!
@rabernat I don't think steps 5 and 6 do any copy of the underlying shapely objects, but I might be wrong.
I agree that Arrow will probably not bring a lot of added value if at some point (step 4) we need to create new shapely (GEOS) objects that can't reuse the same arrow buffers.
For Geographic indexes built on top of S2Geometry, I think @paleolimbot has shown that it could be possible to use S2 with Arrow without copying the data. So I guess it would make more sense to plug in Arrow in this case. However, there is still a lot of work to put all the pieces together, at least on the Python side.
Here I wonder if geoarrow could help with a very efficient decoding of the raw array data into shapely geometries
I don't think so. We should ensure that the way we encode geometries follows a similar logic to what geoarrow implements but I don't think there's any reason to plug arrow in this process, when the source is NetCDF or Zarr.
Once those two are done, we would want to decide where to create the xvec indexes.
That depends on what cf_xarray
folks want to do with their package. From my perspective it makes sense that decoding geometry gives you back an xvec.GeometryIndex rather than a vanilla array. If the additional dependency of xvec, pyproj (and hence PROJ) is an issue, we can potentially depend on it only optionally.
Likewise, it should also be possible to reuse a pyarrow.Array in an xvec.GeometryIndex and its corresponding Xarray variable. If that's not currently possible we should allow it!
Not an expert on pyarrow but I am afraid that shapely's implementation of GEOS bindings may not be compatible with this idea as it assumes a numpy array. But I may be wrong here.
From my perspective it makes sense that decoding geometry gives you back an xvec.GeometryIndex rather than a vanilla array.
Do you construct the index from an array of shapely objects anyway? In which case, it seems preferable to have xvec
depend on cf-xarray
for decoding to shapely objects.
From my perspective it makes sense that decoding geometry gives you back an xvec.GeometryIndex rather than a vanilla array.
Do you construct the index from an array of shapely objects anyway? In which case, it seems preferable to have xvec depend on cf-xarray for decoding to shapely objects.
Yes, we do expect an array of shapely objects. I am fine depending on cf-xarray
and rolling our own encode
and decode
logic that takes cf_to_shapely
and shapely_to_cf
and ensures it results in a GeometryIndex
.
With FixedShapeTensorArray
, there are now methods for converting to/from a multidimensional numpy array, though it's not clear if you can get full xarray metadata to round trip.
I can get xarray.DataArray
to pyarrow with
import pyarrow as pa
import xarray as xr
import numpy as np
def data_array_to_pyarrow(arr: xr.DataArray) -> pa.FixedShapeTensorArray:
value_type = pa.from_numpy_dtype(arr.dtype)
shape = arr.shape[1:]
size = arr.size / arr.shape[0]
arrow_type = pa.fixed_shape_tensor(value_type, shape, dim_names=arr.dims[1:])
pa_arr = pa.FixedSizeListArray.from_arrays(np.ravel(arr, order="C"), size)
return pa.ExtensionArray.from_storage(arrow_type, pa_arr)
But this seems to omit:
It doesn't seem possible to round trip to round trip the other dimensions' coordinates after the first?
Testing with some example data from the xvec docs:
import geopandas as gpd
import numpy as np
import pandas as pd
import pyarrow as pa
import xarray as xr
import xvec
from geodatasets import get_path
counties = gpd.read_file(get_path("geoda.natregimes"))
pop1960 = xr.DataArray(counties.PO60, coords=[counties.geometry], dims=["county"])
pop1960 = pop1960.xvec.set_geom_indexes("county", crs=counties.crs)
population = xr.DataArray(
counties[["PO60", "PO70", "PO80", "PO90"]],
coords=(counties.geometry, [1960, 1970, 1980, 1990]),
dims=("county", "year"),
).xvec.set_geom_indexes("county", crs=counties.crs)
the population
DataArray has county
coordinates, which could be saved as a WKB or GeoArrow geometry column in GeoArrow/GeoParquet, but it also has the coordinates in year
defined as array([1960, 1970, 1980, 1990])
. I'm not sure if/how that could be serialized onto the arrow type. Maybe Arrow could extend the available metadata on the type?
Interesting! If we could save geometries as GeoArrow, could we potentially load this directly to lonboard for visualisation? Assuming we resolve the coordinate issue.
Also, this is limited to DataArray, right? So no flexibility to save a Dataset so far.
Interesting! If we could save geometries as GeoArrow, could we potentially load this directly to lonboard for visualisation? Assuming we resolve the coordinate issue.
Yeah, that's the idea.
Also, this is limited to DataArray, right? So no flexibility to save a Dataset so far.
I'm still pretty new to xarray, so still wrapping my head around some concepts, but you can have multiple FixedShapeTensorArray
columns in a single Arrow table. So as long as all DataArray
s are aligned on the same axis (i.e. the unit of observation of the table), you can have as many stored in a GeoArrow table as you wish
I'm not yet very familiar with arrow, but reading your comment @kylebarron it looks like reusing FixedShapeTensorArray in Xarray (or converting it to/from xarray objects) would better be addressed at the xarray.Variable
level (or later the NamedArray
level).
An Xarray Variable may wrap all sorts of array-like objects beyond numpy and dask arrays. You can see for example a number of wrapper classes in https://github.com/pydata/xarray/blob/main/xarray/core/indexing.py, e.g., PandasIndexingAdapter
that wraps a pandas.Index
. Those wrapper classes are internal to Xarray, though. These will probably be heavily refactored while implementing NamedArray
, although I don't know if there are plans to provide some extension mechanism similarly to Pandas' or PyArrow's ExtensionArray
(cc @andersy005).
Xarray DataArray and Dataset are essentially collections of aligned Variable objects (a DataArray can be viewed as a special case of a Dataset with a unique data variable). Both are closer to the pyarrow.Table
concept I think.
Mid or long-term, interoperability with Arrow is something that we'll need to address in Xarray itself if we want to keep the same level of interoperability with Pandas.
At the moment, we have a decent set of functionality to work with in-memory vector data cubes. What we do not have is a way of saving them to a disk and reading back again, apart from using
pickle
which is not very interoperable.We should find a way of saving to a file format that allows us interoperability with @edzer's
stars
in RSpatial. @edzer, you were mentioning that NetCDF is able to handle VDC. Doesstars
implements that? Can we create a data cube indexed by geometry and save it via GDAL to NetCDF? I tried only very briefly to play with thewrite_stars
function to no avail[^1].The spec for NetCDF way of storing geometries is here but as someone who never worked with the file format, it is not super clear to me how it all works together.
Over in Python, there is some work @dcherian posted on conversion of shapely geometries to CF geometries and back (https://cf-xarray.readthedocs.io/en/latest/generated/cf_xarray.shapely_to_cf.html) [^2]. There is also https://github.com/twhiteaker/cfgeom which may potentially be used (although it is only a reference implementation that is not maintained).
So it seems that NetCDF should be able to store vector data cubes and we may just need to figure out some wrappers for a convenient IO from xvec-backed xarray objects. But I'd welcome some help or at least a guidance from someone more familiar with CF conventions.
The other option seems to be Zarr as discussed here but that is at this stage only an idea and (Geo)Zarr spec is not ready as of now.
The last option is to convert the cube to a long-form dataframe and save it as a GeoParquet but that kind of breaks the point of having a cube in the first place.
[^1]: Getting
Error in !all.equal(match(xydims, names(d)), 1:2) : invalid argument type
when trying to usewrite_stars(st, 'test.nc')
on the example cube from https://r-spatial.github.io/stars/articles/stars1.html#vector-data-cube-example. [^2]: only points are implemented so far but that may change