xarray-contrib / xvec

Vector data cubes for Xarray
https://xvec.readthedocs.io
MIT License
101 stars 9 forks source link

getting values from raster cubes for vector geometries #35

Closed edzer closed 10 months ago

edzer commented 1 year ago

Fantastic work what you've done so far! There are two related use cases I have in mind where vector cubes shine, both related to the extraction of information from raster data cubes:

The input is

The output is a vector data cube with n-1 dimensions, where the raster dimensions x & y are replaced by a single vector dimension holding the m geometries.

This was illustrated here which again points to a blog by @rabernat, but there aggregation took place over a single geometry after which the dimension was dropped and a single time series remained. When we do this for a set of geometries a vector datacube results. The aggregation method used there ("conservative region aggregation") is nice to have but not primary priority IMO.

We will need this to properly implement the openEO process aggregate_spatial in xarray-based openEO backends. @m-mohr @LukeWeidenwalker

martinfleis commented 1 year ago

Thanks a lot for these suggestions!

I have the PR for the first case (values for points) in #36. See how it should work in the docs rendered from that PR -https://xvec--36.org.readthedocs.build/en/36/sampling.html. Is that what you had in mind?

For the latter case, I am aware of the implementations of zonal stats in rasterstats, geocube and xarray-spatial. From what I remember when I dealt with this some time ago, the latter two first create a single DataArray as a rasterised vector layer which causes some trouble when the cell size of the source grid is large compared to the size of geometries. Some just don't make it to the raster version of the GeoDataFrame. But they're much faster :). But it doesn't work for overlapping polys at all.

I think that wrapping rasterstats could be the best shot here, that should give us the closest result to what openEO docs describe. But I may be missing some new option.

edzer commented 1 year ago

Nice! For the sake of reproducibility I would not use random numbers without setting a seed. Also, if I'd copy your example to R, I'd need all 16 digits, not those 5 or so printed, so maybe writing out 10 points (or generating with fixed seed, then rounding to 5 digits) is more useful. Can we also see the array values (z, u, v) that result from the sampling? Is the eraint_uvz dataset somewhere readable as a NetCDF or so?

martinfleis commented 1 year ago

Good point, I have changed the points generation from random to evenly spaced within bounds, that should be easily reproducible in R.

The dataset lives here https://github.com/pydata/xarray-data as a NetCDF. I don't want to print the resulting z, u, v arrays in the documentation but if you want it to check my result against stars, it is below (using the updated points).

>>> sampled_da.z.values
array([[[109808.00940762, 110936.17737136, 114934.79104102,
         120956.86193005, 121479.5452527 , 121696.89871361,
         117580.98317616, 110710.19877312, 107525.79806812,
         106839.23713606],
        [ 50368.73796008,  50530.89054203,  53692.86588995,
          57192.9466215 ,  57491.37637337,  57505.17659311,
          55828.44989471,  52747.55083776,  50327.33730086,
          49723.57768724],
        [ 12371.55793353,  11503.86911738,  13330.67320546,
          14657.21932796,  14829.72207471,  14767.62108588,
          14824.54699231,  14688.26982238,  13266.84718916,
          12714.83839956]],

       [[103537.53456327, 105918.07246841, 112255.823384  ,
         118288.24443783, 121570.97170848, 121874.57654276,
         123275.29884637, 118666.02545321, 116587.36735488,
         114827.83933803],
        [ 47974.3998352 ,  48655.78568486,  52814.82690899,
          56378.73365684,  57548.3022798 ,  57550.02730727,
          57589.70293902,  56025.103026  ,  54800.33352408,
          53382.3609458 ],
        [ 11776.42345724,  10629.28019136,  13304.79779344,
          14810.74677257,  15048.80056308,  14790.04644296,
          13999.98386285,  14041.38452207,  13910.28243454,
          13485.92567753]]])

>>> sampled_da.u.values
array([[[-5.39432071e-01,  7.84308525e+00,  3.04381371e+01,
          2.63742675e+01,  5.21824071e+00,  8.12459943e+00,
          5.03753176e+01,  1.88127022e+01,  1.44532156e+00,
          1.28176025e+00],
        [-1.23456765e+00,  4.49951455e+00,  2.28749990e+01,
          7.64020631e+00, -5.45256230e+00, -2.75065521e+00,
          1.96871261e+01,  1.06566544e+01,  1.39976462e-01,
          1.93757821e+00],
        [ 1.35095926e+00,  1.21885205e+00,  1.42502852e+01,
         -1.46890069e+00, -5.04680442e+00, -4.56241130e+00,
          1.28962377e+00,  3.43793872e+00,  1.19526148e+00,
          3.22719626e+00]],

       [[-1.29747585e+00,  1.66879778e+01,  2.28749990e+01,
          3.62492818e+01,  1.12181100e+01, -1.57805156e+01,
          1.49250271e+00,  1.37816191e+01,  9.18774797e+00,
          3.93233458e-02],
        [-2.39050578e+00,  1.04993839e+01,  1.24055023e+01,
          1.53433151e+01, -1.85106799e+00, -5.82843878e+00,
          1.61674640e+00,  5.99987506e+00,  6.51572228e+00,
         -3.60143708e-01],
        [ 3.48511987e+00,  5.79699612e+00,  7.64020631e+00,
          1.58529230e+00, -6.10995296e+00,  2.81200215e+00,
          1.32112937e-01,  6.40096632e-01,  1.46891213e+00,
         -1.71419116e-01]]])

>>> sampled_da.v.values
array([[[-0.37502003,  0.89072514,  2.42978335,  0.78130436,
         -5.26550769,  8.09386159, -0.79693509, -8.56246567,
          6.28101252, -0.02334451],
        [ 2.51579095,  0.15631581,  1.03120422,  1.82820797,
         -0.46867275,  0.32833101, -0.90635586, -6.92210962,
          3.58610774, -1.062603  ],
        [ 3.3749113 ,  0.32833101,  0.43727397, -0.76539897,
          1.32793044, -2.82814789,  1.49230052, -1.12519742,
          0.37515737, -0.50020887]],

       [[-1.85960676, -3.27347613, -0.34396173, -1.12519742,
         -3.00780821, -0.47679569,  0.15631581,  1.3126402 ,
         -2.15633297, -1.08601618],
        [ 1.92186069, -2.87497425, -0.48444081, -1.54711248,
          0.08607627, -0.53891229, -5.34387017, -0.36737491,
         -1.14048766,  0.16396093],
        [ 3.42173766, -1.71864986, -0.72669555, -2.24998569,
          1.80479479,  1.35134362, -5.26550769, -0.53891229,
          0.49222326,  0.50799132]]])
martinfleis commented 1 year ago

I think that wrapping rasterstats could be the best shot here, that should give us the closest result to what openEO docs describe.

Just a follow-up note - rasterstats is able to work with a single array at a time, which means that if we want to aggregate multiple at the same time (like R, G, B), we would need a loop that would duplicate a lot of work including rasterization. So it may in the end be the best to roll out our own aggregation on top of rasterio's rasterize.

edzer commented 1 year ago

It's encouraging in any case that we get the same values:

library(stars)
# Loading required package: abind
# Loading required package: sf
# Linking to GEOS 3.11.1, GDAL 3.6.2, PROJ 9.1.1; sf_use_s2() is TRUE
c("POINT (-180 -90)", "POINT (-140.083 -70)", "POINT (-100.167 -50)",
       "POINT (-60.25 -30)", "POINT (-20.333 -10)", "POINT (19.583 10)",
       "POINT (59.5 30)", "POINT (99.417 50)", "POINT (139.333 70)",
       "POINT (179.25 90)") |> st_as_sfc(crs = 'OCG:CRS84') -> sfc
r = read_mdim("/tmp/eraint_uvz.nc")
# Warning messages:
# 1: Could not parse expression: ‘`m`**2 `s`**-2’. Returning as a single symbolic unit() 
# 2: Could not parse expression: ‘`m` `s`**-1’. Returning as a single symbolic unit() 
# 3: Could not parse expression: ‘`m` `s`**-1’. Returning as a single symbolic unit() 
e = st_extract(r, sfc)
e$z
# , , 1

#           [,1]     [,2]     [,3]
#  [1,] 109808.0 50368.74 12371.56
#  [2,] 110936.2 50530.89 11503.87
#  [3,] 114934.8 53692.87 13330.67
#  [4,] 120956.9 57192.95 14657.22
#  [5,] 121479.5 57491.38 14829.72
#  [6,] 121696.9 57505.18 14767.62
#  [7,] 117581.0 55828.45 14824.55
#  [8,] 110710.2 52747.55 14688.27
#  [9,] 107525.8 50327.34 13266.85
# [10,] 106839.2 49723.58 12714.84

# , , 2

#           [,1]     [,2]     [,3]
#  [1,] 103537.5 47974.40 11776.42
#  [2,] 105918.1 48655.79 10629.28
#  [3,] 112255.8 52814.83 13304.80
#  [4,] 118288.2 56378.73 14810.75
#  [5,] 121571.0 57548.30 15048.80
#  [6,] 121874.6 57550.03 14790.05
#  [7,] 123275.3 57589.70 13999.98
#  [8,] 118666.0 56025.10 14041.38
#  [9,] 116587.4 54800.33 13910.28
# [10,] 114827.8 53382.36 13485.93

(the warnings indicate that units are wrongly encoded: should have been written without the ** in them - integers are always powers)

edzer commented 1 year ago

But it doesn't work for overlapping polys at all.

Another major flaw of the simple features standard: most of the time we work with polygon coverages, i.e. no overlapping polygons, but we have no way of defining this, or benefiting from it (GEOS now can, I believe for union, but our SF data formats won't tell us they're coverages)

edzer commented 1 year ago

Just a follow-up note - rasterstats is able to work with a single array at a time, which means that if we want to aggregate multiple at the same time (like R, G, B), we would need a loop that would duplicate a lot of work including rasterization.

If you label the polygons 1:n (eh, 0:(n-1)) and rasterize that, then you can use that layer as grouping index for all data cube layers, no need to repeat that.

martinfleis commented 1 year ago

Just a follow-up note - rasterstats is able to work with a single array at a time, which means that if we want to aggregate multiple at the same time (like R, G, B), we would need a loop that would duplicate a lot of work including rasterization.

If you label the polygons 1:n (eh, 0:(n-1)) and rasterize that, then you can use that layer as grouping index for all data cube layers, no need to repeat that.

Yup, that is what geocube is doing and it is quite efficient and good enough in many cases. But I had a case when this was not feasible as some polygons were too small compared to cells and did not cover any cell centroid. You still want to extract the data from those though (using all touching cells as an opt-in method). Is this what you use in stars?

edzer commented 1 year ago

There are several paths in stars, the simple, pixel=point approach, or converting pixels to square polygons (st_interpolate_aw for area-weighted polygon-polygon interpolation), or using exactextractr, which comes with an R package and is blazingly fast.

martinfleis commented 1 year ago

I think I'll give it a bit of time to think this through. We can maybe touch this in person when we meet.

For a reference:

edzer commented 1 year ago

xESMF and xagg were the two options mentioned in @rabernat's blog.

rabernat commented 1 year ago

I'm not following every detail here, but it's great to see all of this progress! 🚀

One potential suggestion would be to consider leveraging Xarray's groupby feature somehow. It's quite powerful. For more advanced and scalable grouping there is also the flox package. A two-step approach would

  1. Turn the vector geometries into categorical arrays somehow. (I'm fuzzy on whether this is always possible.)
  2. Hand things off to groupby / flox.

One nice thing about that is that the user can specify the aggregation operation, e.g.

ds.groupby("region").mean("region")
ds.groupby('region").max("region")
# etc
edzer commented 1 year ago

(I'm fuzzy on whether this is always possible.)

Thanks - yes, this is indeed possible if the polygons don't overlap, so every pixel can be assigned to a single geometry. But that is the most common use case, IMO.

martinfleis commented 10 months ago

With the help of @masawdah, we now have a decent zonal_stats functionality for aggregation over a set of polygons or linestrings. There are two implementations, one based on rasterio.features.rasterize that creates a categorical array followed by group-by (which is quite fast and a default) and another that iterates over individual geometries and uses rasterio.features.geometry_mask with standard aggregation. The latter is way slower even though it is running in parallel but should support use cases with overlapping geometries, those where small polygons are dropped during rasterisation, or when you may have memory issues (I think that is the main reason @masawdah implemented it originally).

All of them support standard aggregations available in xarray (including passing kwargs to stuff like quantile), custom aggregations via any callable accepted by reduce, or a list of these (with #56).

Once exactextract's Python bindings are releases (they already exist!), I'd like to expose that for weighted aggregation. Or we can do it via geopandas if someone wants to make a PR :). But I suppose that those options based on GDAL we have will cover majority of use cases.

I will add some proper documentation of these in coming days and cut 0.2.0. @masawdah could you check if the current implementation works for openEO needs before we do the release?

edzer commented 10 months ago

awesome!!

masawdah commented 10 months ago

With the help of @masawdah, we now have a decent zonal_stats functionality for aggregation over a set of polygons or linestrings. There are two implementations, one based on rasterio.features.rasterize that creates a categorical array followed by group-by (which is quite fast and a default) and another that iterates over individual geometries and uses rasterio.features.geometry_mask with standard aggregation. The latter is way slower even though it is running in parallel but should support use cases with overlapping geometries, those where small polygons are dropped during rasterisation, or when you may have memory issues (I think that is the main reason @masawdah implemented it originally).

All of them support standard aggregations available in xarray (including passing kwargs to stuff like quantile), custom aggregations via any callable accepted by reduce, or a list of these (with #56).

Once exactextract's Python bindings are releases (they already exist!), I'd like to expose that for weighted aggregation. Or we can do it via geopandas if someone wants to make a PR :). But I suppose that those options based on GDAL we have will cover majority of use cases.

I will add some proper documentation of these in coming days and cut 0.2.0. @masawdah could you check if the current implementation works for openEO needs before we do the release?

@martinfleis I did a test with the current implementation and it worked as expected for openEO. Looking forward to use it in the new release :)

martinfleis commented 10 months ago

@masawdah That is great to hear! I'll try to cut 0.2.0 before leaving for holidays!