ENH: add zonal_stats - Githubissues

masawdah commented 1 year ago

Thanks for providing this valuable package. The idea of this PR to aggregate raster data cube over a set of polygon geometries #35 as it needed to use within openEO process Open-EO/openeo-processes-dask#124 . It is based on dask array and parallelization which make it efficient and fast. It has been tested to aggregate polygons over data on continental level and shows a good performance.

I'd like to know your thoughts about that.

Here is an example:

import datetime
import geopandas as gpd
import pandas as pd
import xarray as xr
import numpy as np
import xvec
from shapely.geometry import Polygon

# Create a dataset with 2 variables and 3 time steps 
np.random.seed(0)

temperature = 15 + 8 * np.random.randn(20, 20, 3)
precipitation = 15 + 10 * np.random.randn(20, 20,3)
lat = np.linspace(30,40,20)
lon = np.linspace(10,20,20)

time = pd.date_range("2014-09-06", periods=3)
reference_time = pd.Timestamp("2014-09-05")

ds = xr.Dataset(
    data_vars=dict(
        temperature=(["x", "y", "time"], temperature),
        precipitation=(["x", "y", "time"], precipitation),
    ),
    coords=dict(
        x=lon,
        y=lat,
        time=time,
        reference_time=reference_time,
    ),
    attrs=dict(description="Weather related data."),
)

ds = ds.chunk(dict(x=4,y=4))

# Create geometries over the dataset
num_polygons = 2  # Adjust the number of polygons as needed
polygons = []

for _ in range(num_polygons):
    # Generate random polygon coordinates within the bounding box of the downsampled dataset
    lon = np.random.uniform(ds.x.min(), ds.x.max(), 4)
    lat = np.random.uniform(ds.y.min(), ds.y.max(), 4)
    polygons.append(Polygon(zip(lon, lat)))

geoseries = gpd.GeoSeries(polygons)
gdf = gpd.GeoDataFrame(geometry=geoseries)

gdf = gdf.set_geometry('geometry')
gdf.crs = 'EPSG:4326'

polys = gdf.geometry.values

extracted = ds.xvec.agg_polys(polys, stat="mean")

martinfleis commented 1 year ago

Thanks! I'll have a look later but note that you need to add new deps to https://github.com/xarray-contrib/xvec/tree/main/ci

codecov[bot] commented 1 year ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Comparison is base (6bed799) 99.06% compared to head (c230edb) 99.20%. Report is 3 commits behind head on main.

Additional details and impacted files

```diff @@ Coverage Diff @@ ## main #52 +/- ## ========================================== + Coverage 99.06% 99.20% +0.13% ========================================== Files 3 4 +1 Lines 321 377 +56 ========================================== + Hits 318 374 +56 Misses 3 3 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

masawdah commented 1 year ago

I'll go through the notes and work on them.

masawdah commented 1 year ago

Hey, I went through the notes. Now the code passed the pre-commit. And I made it an optional to use dask and number of jobs if the data is big as the following:

extracted = ds.xvec.zonal_stats(polys, stat="mean", dask = False, n_jobs = -1)

ds1 = ds.chunk(dict(x=4,y=4))
extracted = ds1.xvec.zonal_stats(polys, stat="mean", dask = True, n_jobs = -1)

And I think it's possible to use geodatasets and xarray example data to get some raster cubes instead of storing the data in the repo. I'll look into that and let u know.

masawdah commented 1 year ago

I've tested it using counties fetched from geodatasets and 'air_temperature' data cube. It works but the order of (x, y ) dimensions is not consistent with what implemented now (similar to data cube structure in openEO) so I had to modify the order manually.

We can add function to make sure the dimension order is as expected. What do u think ?

from geodatasets import get_path
counties = gpd.read_file(get_path("geoda.natregimes"))
polys = counties.geometry.values

ds = xr.tutorial.open_dataset("air_temperature")

extracted = ds.xvec.zonal_stats(polys[:2], stat="sum", dask = False, n_jobs = -1)

extracted

martinfleis commented 1 year ago

We can add function to make sure the dimension order is as expected. What do u think ?

In extract_points, we have the dimension mapping exposed directly as x_coords and y_coords (https://xvec.readthedocs.io/en/stable/generated/xarray.Dataset.xvec.extract_points.html). We could maybe use something like that?

masawdah commented 1 year ago

We can add function to make sure the dimension order is as expected. What do u think ?

In extract_points, we have the dimension mapping exposed directly as x_coords and y_coords (https://xvec.readthedocs.io/en/stable/generated/xarray.Dataset.xvec.extract_points.html). We could maybe use something like that?

I'll take a look at that. But for now I made (lon, lat) as default dims names to aggregate over them. And added a function to rename any other possible names (in a dictionary) to (lon, lat) to be consistent with the function requirement. So, now it works for both data cubes from openEO and data cubes from xarray.

masawdah commented 1 year ago

We can add function to make sure the dimension order is as expected. What do u think ?

In extract_points, we have the dimension mapping exposed directly as x_coords and y_coords (https://xvec.readthedocs.io/en/stable/generated/xarray.Dataset.xvec.extract_points.html). We could maybe use something like that?

I think it is saver to do it in this way, the name of the coordinates or its order "as dask does not support the name " https://docs.dask.org/en/stable/generated/dask.array.sum.html

masawdah commented 1 year ago

I've added the dimension mapping like what used for "extract_points", with option to handle dask dims.

from geodatasets import get_path
import geopandas as gpd
import xarray as xr
import xvec
import dask

counties = gpd.read_file(get_path("geoda.natregimes"))
ds = xr.tutorial.open_dataset("air_temperature")

extracted = ds.xvec.zonal_stats(polys[:2], x_coords='lon',y_coords='lat', stat="mean", n_jobs = -1)

ds2 = ds.chunk(dict(lon=4,lat=4))
extracted = ds2.xvec.zonal_stats(polys[:2], x_axis=2, y_axis=1, stat="sum", dask = True, n_jobs = -1)

masawdah commented 1 year ago

Thanks for your review and feedback. I reviewed the suggestions and made the necessary modifications to the code accordingly.

masawdah commented 12 months ago

Hi , I've adjusted the code to be more flexible and dimension-agnostic, as you can see below. I've just hard-coded the name of one coordinate as 'data_variables' to save the data variable names temporary, but we can make it a user input. However, I believe this is not critical, as it's an internal step that doesn't affect the output, unless we intend to have the output as an xarray.DataArray.

martinfleis commented 10 months ago

Hey, I did a deep dive into this which resulted in refactoring of the code. I opted to use more xarray logic on the process than before, which simplifies the code quite a bit.

I also noticed that the implementation was providing wrong results when I compared it to the one based on geocube so I also fixed that and since I have a geocube-based Implementation now I will make a follow-up PR exposing it as another engine. It is faster but not suitable for small polygons. Hopefully, we can also expose exactextract once the Python bindings are released (a prototype is there).

I have removed chunking as I believe that is something a user can do if they run into issues but it was more of a bottleneck in most use cases I tried.

Plus some other changes you can see in the code.

I plan on adding tests etc and then merge. I will follow-up with the geocube version, user guide notebook and a release.

martinfleis commented 10 months ago

Looking at it in more detail, we can probably use rasterio.features.rasterize ourselves without depending on geocube. I'll look into that during the follow up.

masawdah commented 10 months ago

Sounds great, looking forward to the new release :)

xarray-contrib / xvec

ENH: add zonal_stats #52

Codecov Report