Closed masawdah closed 10 months ago
Thanks! I'll have a look later but note that you need to add new deps to https://github.com/xarray-contrib/xvec/tree/main/ci
All modified and coverable lines are covered by tests :white_check_mark:
Comparison is base (
6bed799
) 99.06% compared to head (c230edb
) 99.20%. Report is 3 commits behind head on main.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
I'll go through the notes and work on them.
Hey, I went through the notes. Now the code passed the pre-commit. And I made it an optional to use dask and number of jobs if the data is big as the following:
extracted = ds.xvec.zonal_stats(polys, stat="mean", dask = False, n_jobs = -1)
ds1 = ds.chunk(dict(x=4,y=4))
extracted = ds1.xvec.zonal_stats(polys, stat="mean", dask = True, n_jobs = -1)
And I think it's possible to use geodatasets and xarray example data to get some raster cubes instead of storing the data in the repo. I'll look into that and let u know.
I've tested it using counties fetched from geodatasets and 'air_temperature' data cube. It works but the order of (x, y ) dimensions is not consistent with what implemented now (similar to data cube structure in openEO) so I had to modify the order manually.
We can add function to make sure the dimension order is as expected. What do u think ?
from geodatasets import get_path
counties = gpd.read_file(get_path("geoda.natregimes"))
polys = counties.geometry.values
ds = xr.tutorial.open_dataset("air_temperature")
extracted = ds.xvec.zonal_stats(polys[:2], stat="sum", dask = False, n_jobs = -1)
extracted
We can add function to make sure the dimension order is as expected. What do u think ?
In extract_points
, we have the dimension mapping exposed directly as x_coords
and y_coords
(https://xvec.readthedocs.io/en/stable/generated/xarray.Dataset.xvec.extract_points.html). We could maybe use something like that?
We can add function to make sure the dimension order is as expected. What do u think ?
In
extract_points
, we have the dimension mapping exposed directly asx_coords
andy_coords
(https://xvec.readthedocs.io/en/stable/generated/xarray.Dataset.xvec.extract_points.html). We could maybe use something like that?
I'll take a look at that. But for now I made (lon, lat) as default dims names to aggregate over them. And added a function to rename any other possible names (in a dictionary) to (lon, lat) to be consistent with the function requirement. So, now it works for both data cubes from openEO and data cubes from xarray.
We can add function to make sure the dimension order is as expected. What do u think ?
In
extract_points
, we have the dimension mapping exposed directly asx_coords
andy_coords
(https://xvec.readthedocs.io/en/stable/generated/xarray.Dataset.xvec.extract_points.html). We could maybe use something like that?
I think it is saver to do it in this way, the name of the coordinates or its order "as dask does not support the name " https://docs.dask.org/en/stable/generated/dask.array.sum.html
I've added the dimension mapping like what used for "extract_points", with option to handle dask dims.
from geodatasets import get_path
import geopandas as gpd
import xarray as xr
import xvec
import dask
counties = gpd.read_file(get_path("geoda.natregimes"))
ds = xr.tutorial.open_dataset("air_temperature")
extracted = ds.xvec.zonal_stats(polys[:2], x_coords='lon',y_coords='lat', stat="mean", n_jobs = -1)
ds2 = ds.chunk(dict(lon=4,lat=4))
extracted = ds2.xvec.zonal_stats(polys[:2], x_axis=2, y_axis=1, stat="sum", dask = True, n_jobs = -1)
Thanks for your review and feedback. I reviewed the suggestions and made the necessary modifications to the code accordingly.
Hi , I've adjusted the code to be more flexible and dimension-agnostic, as you can see below. I've just hard-coded the name of one coordinate as 'data_variables' to save the data variable names temporary, but we can make it a user input. However, I believe this is not critical, as it's an internal step that doesn't affect the output, unless we intend to have the output as an xarray.DataArray.
Hey, I did a deep dive into this which resulted in refactoring of the code. I opted to use more xarray logic on the process than before, which simplifies the code quite a bit.
I also noticed that the implementation was providing wrong results when I compared it to the one based on geocube
so I also fixed that and since I have a geocube-based Implementation now I will make a follow-up PR exposing it as another engine. It is faster but not suitable for small polygons. Hopefully, we can also expose exactextract
once the Python bindings are released (a prototype is there).
I have removed chunking as I believe that is something a user can do if they run into issues but it was more of a bottleneck in most use cases I tried.
Plus some other changes you can see in the code.
I plan on adding tests etc and then merge. I will follow-up with the geocube version, user guide notebook and a release.
Looking at it in more detail, we can probably use rasterio.features.rasterize
ourselves without depending on geocube
. I'll look into that during the follow up.
Sounds great, looking forward to the new release :)
Thanks for providing this valuable package. The idea of this PR to aggregate raster data cube over a set of polygon geometries #35 as it needed to use within openEO process Open-EO/openeo-processes-dask#124 . It is based on dask array and parallelization which make it efficient and fast. It has been tested to aggregate polygons over data on continental level and shows a good performance.
I'd like to know your thoughts about that.
Here is an example: