Closed edzer closed 10 months ago
Thanks a lot for these suggestions!
I have the PR for the first case (values for points) in #36. See how it should work in the docs rendered from that PR -https://xvec--36.org.readthedocs.build/en/36/sampling.html. Is that what you had in mind?
For the latter case, I am aware of the implementations of zonal stats in rasterstats, geocube and xarray-spatial. From what I remember when I dealt with this some time ago, the latter two first create a single DataArray as a rasterised vector layer which causes some trouble when the cell size of the source grid is large compared to the size of geometries. Some just don't make it to the raster version of the GeoDataFrame. But they're much faster :). But it doesn't work for overlapping polys at all.
I think that wrapping rasterstats
could be the best shot here, that should give us the closest result to what openEO docs describe. But I may be missing some new option.
Nice! For the sake of reproducibility I would not use random numbers without setting a seed. Also, if I'd copy your example to R, I'd need all 16 digits, not those 5 or so printed, so maybe writing out 10 points (or generating with fixed seed, then rounding to 5 digits) is more useful. Can we also see the array values (z, u, v) that result from the sampling? Is the eraint_uvz
dataset somewhere readable as a NetCDF or so?
Good point, I have changed the points generation from random to evenly spaced within bounds, that should be easily reproducible in R.
The dataset lives here https://github.com/pydata/xarray-data as a NetCDF. I don't want to print the resulting z, u, v arrays in the documentation but if you want it to check my result against stars, it is below (using the updated points).
>>> sampled_da.z.values
array([[[109808.00940762, 110936.17737136, 114934.79104102,
120956.86193005, 121479.5452527 , 121696.89871361,
117580.98317616, 110710.19877312, 107525.79806812,
106839.23713606],
[ 50368.73796008, 50530.89054203, 53692.86588995,
57192.9466215 , 57491.37637337, 57505.17659311,
55828.44989471, 52747.55083776, 50327.33730086,
49723.57768724],
[ 12371.55793353, 11503.86911738, 13330.67320546,
14657.21932796, 14829.72207471, 14767.62108588,
14824.54699231, 14688.26982238, 13266.84718916,
12714.83839956]],
[[103537.53456327, 105918.07246841, 112255.823384 ,
118288.24443783, 121570.97170848, 121874.57654276,
123275.29884637, 118666.02545321, 116587.36735488,
114827.83933803],
[ 47974.3998352 , 48655.78568486, 52814.82690899,
56378.73365684, 57548.3022798 , 57550.02730727,
57589.70293902, 56025.103026 , 54800.33352408,
53382.3609458 ],
[ 11776.42345724, 10629.28019136, 13304.79779344,
14810.74677257, 15048.80056308, 14790.04644296,
13999.98386285, 14041.38452207, 13910.28243454,
13485.92567753]]])
>>> sampled_da.u.values
array([[[-5.39432071e-01, 7.84308525e+00, 3.04381371e+01,
2.63742675e+01, 5.21824071e+00, 8.12459943e+00,
5.03753176e+01, 1.88127022e+01, 1.44532156e+00,
1.28176025e+00],
[-1.23456765e+00, 4.49951455e+00, 2.28749990e+01,
7.64020631e+00, -5.45256230e+00, -2.75065521e+00,
1.96871261e+01, 1.06566544e+01, 1.39976462e-01,
1.93757821e+00],
[ 1.35095926e+00, 1.21885205e+00, 1.42502852e+01,
-1.46890069e+00, -5.04680442e+00, -4.56241130e+00,
1.28962377e+00, 3.43793872e+00, 1.19526148e+00,
3.22719626e+00]],
[[-1.29747585e+00, 1.66879778e+01, 2.28749990e+01,
3.62492818e+01, 1.12181100e+01, -1.57805156e+01,
1.49250271e+00, 1.37816191e+01, 9.18774797e+00,
3.93233458e-02],
[-2.39050578e+00, 1.04993839e+01, 1.24055023e+01,
1.53433151e+01, -1.85106799e+00, -5.82843878e+00,
1.61674640e+00, 5.99987506e+00, 6.51572228e+00,
-3.60143708e-01],
[ 3.48511987e+00, 5.79699612e+00, 7.64020631e+00,
1.58529230e+00, -6.10995296e+00, 2.81200215e+00,
1.32112937e-01, 6.40096632e-01, 1.46891213e+00,
-1.71419116e-01]]])
>>> sampled_da.v.values
array([[[-0.37502003, 0.89072514, 2.42978335, 0.78130436,
-5.26550769, 8.09386159, -0.79693509, -8.56246567,
6.28101252, -0.02334451],
[ 2.51579095, 0.15631581, 1.03120422, 1.82820797,
-0.46867275, 0.32833101, -0.90635586, -6.92210962,
3.58610774, -1.062603 ],
[ 3.3749113 , 0.32833101, 0.43727397, -0.76539897,
1.32793044, -2.82814789, 1.49230052, -1.12519742,
0.37515737, -0.50020887]],
[[-1.85960676, -3.27347613, -0.34396173, -1.12519742,
-3.00780821, -0.47679569, 0.15631581, 1.3126402 ,
-2.15633297, -1.08601618],
[ 1.92186069, -2.87497425, -0.48444081, -1.54711248,
0.08607627, -0.53891229, -5.34387017, -0.36737491,
-1.14048766, 0.16396093],
[ 3.42173766, -1.71864986, -0.72669555, -2.24998569,
1.80479479, 1.35134362, -5.26550769, -0.53891229,
0.49222326, 0.50799132]]])
I think that wrapping rasterstats could be the best shot here, that should give us the closest result to what openEO docs describe.
Just a follow-up note - rasterstats is able to work with a single array at a time, which means that if we want to aggregate multiple at the same time (like R, G, B), we would need a loop that would duplicate a lot of work including rasterization. So it may in the end be the best to roll out our own aggregation on top of rasterio's rasterize
.
It's encouraging in any case that we get the same values:
library(stars)
# Loading required package: abind
# Loading required package: sf
# Linking to GEOS 3.11.1, GDAL 3.6.2, PROJ 9.1.1; sf_use_s2() is TRUE
c("POINT (-180 -90)", "POINT (-140.083 -70)", "POINT (-100.167 -50)",
"POINT (-60.25 -30)", "POINT (-20.333 -10)", "POINT (19.583 10)",
"POINT (59.5 30)", "POINT (99.417 50)", "POINT (139.333 70)",
"POINT (179.25 90)") |> st_as_sfc(crs = 'OCG:CRS84') -> sfc
r = read_mdim("/tmp/eraint_uvz.nc")
# Warning messages:
# 1: Could not parse expression: ‘`m`**2 `s`**-2’. Returning as a single symbolic unit()
# 2: Could not parse expression: ‘`m` `s`**-1’. Returning as a single symbolic unit()
# 3: Could not parse expression: ‘`m` `s`**-1’. Returning as a single symbolic unit()
e = st_extract(r, sfc)
e$z
# , , 1
# [,1] [,2] [,3]
# [1,] 109808.0 50368.74 12371.56
# [2,] 110936.2 50530.89 11503.87
# [3,] 114934.8 53692.87 13330.67
# [4,] 120956.9 57192.95 14657.22
# [5,] 121479.5 57491.38 14829.72
# [6,] 121696.9 57505.18 14767.62
# [7,] 117581.0 55828.45 14824.55
# [8,] 110710.2 52747.55 14688.27
# [9,] 107525.8 50327.34 13266.85
# [10,] 106839.2 49723.58 12714.84
# , , 2
# [,1] [,2] [,3]
# [1,] 103537.5 47974.40 11776.42
# [2,] 105918.1 48655.79 10629.28
# [3,] 112255.8 52814.83 13304.80
# [4,] 118288.2 56378.73 14810.75
# [5,] 121571.0 57548.30 15048.80
# [6,] 121874.6 57550.03 14790.05
# [7,] 123275.3 57589.70 13999.98
# [8,] 118666.0 56025.10 14041.38
# [9,] 116587.4 54800.33 13910.28
# [10,] 114827.8 53382.36 13485.93
(the warnings indicate that units are wrongly encoded: should have been written without the **
in them - integers are always powers)
But it doesn't work for overlapping polys at all.
Another major flaw of the simple features standard: most of the time we work with polygon coverages, i.e. no overlapping polygons, but we have no way of defining this, or benefiting from it (GEOS now can, I believe for union, but our SF data formats won't tell us they're coverages)
Just a follow-up note - rasterstats is able to work with a single array at a time, which means that if we want to aggregate multiple at the same time (like R, G, B), we would need a loop that would duplicate a lot of work including rasterization.
If you label the polygons 1:n (eh, 0:(n-1)) and rasterize that, then you can use that layer as grouping index for all data cube layers, no need to repeat that.
Just a follow-up note - rasterstats is able to work with a single array at a time, which means that if we want to aggregate multiple at the same time (like R, G, B), we would need a loop that would duplicate a lot of work including rasterization.
If you label the polygons 1:n (eh, 0:(n-1)) and rasterize that, then you can use that layer as grouping index for all data cube layers, no need to repeat that.
Yup, that is what geocube is doing and it is quite efficient and good enough in many cases. But I had a case when this was not feasible as some polygons were too small compared to cells and did not cover any cell centroid. You still want to extract the data from those though (using all touching cells as an opt-in method). Is this what you use in stars
?
There are several paths in stars
, the simple, pixel=point approach, or converting pixels to square polygons (st_interpolate_aw
for area-weighted polygon-polygon interpolation), or using exactextractr, which comes with an R package and is blazingly fast.
I think I'll give it a bit of time to think this through. We can maybe touch this in person when we meet.
For a reference:
I'm not following every detail here, but it's great to see all of this progress! 🚀
One potential suggestion would be to consider leveraging Xarray's groupby feature somehow. It's quite powerful. For more advanced and scalable grouping there is also the flox package. A two-step approach would
One nice thing about that is that the user can specify the aggregation operation, e.g.
ds.groupby("region").mean("region")
ds.groupby('region").max("region")
# etc
(I'm fuzzy on whether this is always possible.)
Thanks - yes, this is indeed possible if the polygons don't overlap, so every pixel can be assigned to a single geometry. But that is the most common use case, IMO.
With the help of @masawdah, we now have a decent zonal_stats functionality for aggregation over a set of polygons or linestrings. There are two implementations, one based on rasterio.features.rasterize
that creates a categorical array followed by group-by (which is quite fast and a default) and another that iterates over individual geometries and uses rasterio.features.geometry_mask
with standard aggregation. The latter is way slower even though it is running in parallel but should support use cases with overlapping geometries, those where small polygons are dropped during rasterisation, or when you may have memory issues (I think that is the main reason @masawdah implemented it originally).
All of them support standard aggregations available in xarray (including passing kwargs to stuff like quantile
), custom aggregations via any callable accepted by reduce
, or a list of these (with #56).
Once exactextract
's Python bindings are releases (they already exist!), I'd like to expose that for weighted aggregation. Or we can do it via geopandas if someone wants to make a PR :). But I suppose that those options based on GDAL
we have will cover majority of use cases.
I will add some proper documentation of these in coming days and cut 0.2.0. @masawdah could you check if the current implementation works for openEO needs before we do the release?
awesome!!
With the help of @masawdah, we now have a decent zonal_stats functionality for aggregation over a set of polygons or linestrings. There are two implementations, one based on
rasterio.features.rasterize
that creates a categorical array followed by group-by (which is quite fast and a default) and another that iterates over individual geometries and usesrasterio.features.geometry_mask
with standard aggregation. The latter is way slower even though it is running in parallel but should support use cases with overlapping geometries, those where small polygons are dropped during rasterisation, or when you may have memory issues (I think that is the main reason @masawdah implemented it originally).All of them support standard aggregations available in xarray (including passing kwargs to stuff like
quantile
), custom aggregations via any callable accepted byreduce
, or a list of these (with #56).Once
exactextract
's Python bindings are releases (they already exist!), I'd like to expose that for weighted aggregation. Or we can do it via geopandas if someone wants to make a PR :). But I suppose that those options based onGDAL
we have will cover majority of use cases.I will add some proper documentation of these in coming days and cut 0.2.0. @masawdah could you check if the current implementation works for openEO needs before we do the release?
@martinfleis I did a test with the current implementation and it worked as expected for openEO. Looking forward to use it in the new release :)
@masawdah That is great to hear! I'll try to cut 0.2.0 before leaving for holidays!
Fantastic work what you've done so far! There are two related use cases I have in mind where vector cubes shine, both related to the extraction of information from raster data cubes:
The input is
The output is a vector data cube with n-1 dimensions, where the raster dimensions x & y are replaced by a single vector dimension holding the m geometries.
This was illustrated here which again points to a blog by @rabernat, but there aggregation took place over a single geometry after which the dimension was dropped and a single time series remained. When we do this for a set of geometries a vector datacube results. The aggregation method used there ("conservative region aggregation") is nice to have but not primary priority IMO.
We will need this to properly implement the openEO process aggregate_spatial in xarray-based openEO backends. @m-mohr @LukeWeidenwalker