xarray-contrib / xbatcher

Batch generation from xarray datasets
https://xbatcher.readthedocs.io
Apache License 2.0
165 stars 27 forks source link

Slicing multiple data layers or band channels with different spatial resolutions #93

Open weiji14 opened 2 years ago

weiji14 commented 2 years ago

Is your feature request related to a problem?

To enable chipping/batching datasets with different spatial resolutions, each dataset (either an xarray.DataArray or xarray.Dataset) currently needs to be sliced separately in xbatcher v0.1.0. The key limitation is that xbatcher assumes every xarray.DataArray 'layer' to have the same resolution, and xbatcher.BatchGenerator would use xarray's .isel method to index and slice along the specified dimensions.

https://github.com/xarray-contrib/xbatcher/blob/72ce00f2b5c9af71f102c5e6543cef5a78e15e35/xbatcher/generators.py#L41-L43

However, this is not always the case, for example:

Describe the solution you'd like

Ideally, there would be:

  1. A way to store multi-resolution datasets in a single datacube-like structure, while having each layer stacked in the same geographical (or n-dimensional space). This is something datatree might be able to handle, i.e. have each data layer with a different resolution be on a separate node of the datatree.
---- Sentinel-2 10m bands
|---            20m bands
|---            60m bands
  1. From the multi-resolution data object, xbatcher would then need to have a way of slicing these multi-resolution datasets. Maybe DataTree.isel could work?

Describe alternatives you've considered

Keep xbatcher to be focused on xarray.DataArray and xarray.Dataset only (and not bring in xarray.DataTree). Users would then need to implement their own way of slicing multi-resolution datasets themselves in an ad-hoc way.

Additional context

There was some discussion before at https://github.com/microsoft/torchgeo/issues/279 about sampling in pixel/image units or coordinate reference system (CRS) units. If working with multi-resolution datasets though, sampling in pixel/images would require some math (e.g. 20 pixels for a 500m resolution grid would be 10 pixels for a 1000m resolution grid). The CRS based indexing method however, would require something like https://corteva.github.io/rioxarray/stable/rioxarray.html#rioxarray.raster_dataset.RasterDataset.clip_box.

Other references:

djhoese commented 2 years ago

I just wanted to throw out Satpy's Scene.crop as an example of something very similar to this request. The meat of the logic is here:

https://github.com/pytroll/satpy/blob/a4417c69e77159413360860019c0d4e902694a30/satpy/scene.py#L712-L743

Basically Satpy uses the pyresample library's AreaDefinition and SwathDefinition objects to define the region/bounds and CRS of the data. These objects have the necessary logic to say "I have this lon/lat (or x/y) bounding box, what are the slicing indexes for your Area". We take care to grab the coarsest resolution AreaDefinition as our base reference for what the slices should be and then do the necessary arithmetic to the indexes to make them work for the finer resolutions. This way even if the bounding box/sub-region would result in tighter (more accurate) slices for the finer resolutions alone, we make all the resolutions cover the same geographic area so calculations can be done between the resolutions (i.e. our result is never half a coarse pixel).

This logic makes sense for most satellite-base sensors we work with (GOES-R ABI, VIIRS, etc). Other sensors like MODIS don't follow this nice even alignment of the pixels, but that is typically ignored for most of the imagery work we do:

https://www.icare.univ-lille.fr/modis-geolocation/

jhamman commented 2 years ago

This is not an area I have all that much experience in but after carefully reading this issue, I have a straw man proposal to offer. Before I get to my proposal, I'll offer a few bits of perspective that may be useful to ground the discussion here.

  1. As @weiji14 makes clear, the existing batch generator in Xbatcher is really built around the assumption that xobj.isel is sufficient for selecting batches. This works because DataArrays and Datasets do a fairly good job at enforcing alignment. I think there is clearly value in this approach and we should keep it.
  2. Multi-resolution datasets (or more generally datasets that don't align to a single datasets coordinate system) should also be supported by Xbatcher when connected to Datatree objects but we will need some specific/opinionated adapters to make this work.

My thought is that to build a new batch generator class that works with DataTree objects but uses one part of the datatree as its reference for batching the others. I think could be quite similar to the concept @djhoese introduced with Satpy's scene.crop method.

Imagine if you had a class like this

class BboxBatchGenerator:
    def __init__(self, dt: DataTree, ref: str, input_dims, ...):
        ...

    def __iter__(self) -> Iterator[dt.Datatree]:
        ...

gen = BboxBatchGenerator(dt, ref='10m')

Under the hood, you may imagine using the existing Xbatcher generator for the reference dataset (e.g. 10m) then using a more complex selection method (e.g. .sel) to extract batches from subsequent datasets (e.g. 20m and 60m).

Underlying my proposal here is that I think we should be open to developing additional generators in Xbatcher. Today we have just BatchGenerator but supporting additional data models (e.g. datatree) or more complex batch indexing may necessitate new generators.

weiji14 commented 1 year ago

Just noting that there's a PR on upstreaming datatree into xarray at https://github.com/pydata/xarray/pull/7418. If/When that PR is merged and an new xarray release made, any code depending on datatree should use from xarray import DataTree instead of from datatree import DataTree.

Under the hood, you may imagine using the existing Xbatcher generator for the reference dataset (e.g. 10m) then using a more complex selection method (e.g. .sel) to extract batches from subsequent datasets (e.g. 20m and 60m).

Underlying my proposal here is that I think we should be open to developing additional generators in Xbatcher. Today we have just BatchGenerator but supporting additional data models (e.g. datatree) or more complex batch indexing may necessitate new generators.

The BboxBatchGenerator proposal sounds like a good plan. The current BatchGenerator is overloaded with many options, and from a user perspective, it might be cleaner to have a separate, dedicated class to handle multi-resolution DataTree objects.

From a developer/maintainer perspective, should this new BboxBatchGenerator be subclassed from BatchGenerator? Or will this just make generators.py way too complicated? Trying to think ahead on what's ideal if we'll have multiple ???BatchGenerators that each approach chip sampling a little differently.

weiji14 commented 1 year ago

Thanks @maxrjones for pointing me to that satpy <-> datatree integration thread at https://github.com/pytroll/satpy/issues/2352 started by @djhoese. If that design proposal goes ahead, it sounds like we might be able to let satpy handle the multi-resolution data fusion/stacking part into an xarray.DataTree structure (?), and let xbatcher focus on just the batch generation/slicing/cropping part. Or if we can find a way to integrate satpy and xbatcher more closely, @djhoese's suggestion of using Satpy's Scene.crop at https://github.com/xarray-contrib/xbatcher/issues/93#issuecomment-1243025054 might well something we can wrap around in BboxBatchGenerator (at least for the 2D case).

I'll have some time this week to experiment on how xbatcher might work with datatree, will open a PR once I get something working.

ljstrnadiii commented 8 months ago

@weiji14 have you arrived at some functional approach yet? Did datatree help you here?

For my use case, I am willing to resample all datasets to a resolution with a common divisor so that we can implement simpler/faster index logic with .isel instead of .sel (which would be much slower afaik).

weiji14 commented 8 months ago

Since datatree will be upstreamed to xarray (see https://github.com/pydata/xarray/issues/8572), it might be that things will get easier soon, but I don't know when. For your use case though, check out the datatree_slice_generator function I've implemented in the proof-of-concept PR at https://github.com/xarray-contrib/xbatcher/pull/171, it also relies on .isel, and might be good enough for basic use.

ljstrnadiii commented 8 months ago

Since datatree will be upstreamed to xarray (see https://github.com/pydata/xarray/issues/8572), it might be that things will get easier soon, but I don't know when. For your use case though, check out the datatree_slice_generator function I've implemented in the proof-of-concept PR at https://github.com/xarray-contrib/xbatcher/pull/171, it also relies on .isel, and might be good enough for basic use.

Awesome, I'll take a look! Being able to support multiple resolutions both in space and time with this datatree approach is likely going to be the approach we take, but I have found the for loop/generator approach in xbatcher has been a bit slow and before I have actually relied on numpys as_strided using a map_block approach. However, chunks are likely not going to spatially align between different resolution datasets, so I'll have to think that through and implement a map_slices on coords approach.