Open weiji14 opened 2 years ago
I just wanted to throw out Satpy's Scene.crop
as an example of something very similar to this request. The meat of the logic is here:
Basically Satpy uses the pyresample library's AreaDefinition
and SwathDefinition
objects to define the region/bounds and CRS of the data. These objects have the necessary logic to say "I have this lon/lat (or x/y) bounding box, what are the slicing indexes for your Area". We take care to grab the coarsest resolution AreaDefinition as our base reference for what the slices should be and then do the necessary arithmetic to the indexes to make them work for the finer resolutions. This way even if the bounding box/sub-region would result in tighter (more accurate) slices for the finer resolutions alone, we make all the resolutions cover the same geographic area so calculations can be done between the resolutions (i.e. our result is never half a coarse pixel).
This logic makes sense for most satellite-base sensors we work with (GOES-R ABI, VIIRS, etc). Other sensors like MODIS don't follow this nice even alignment of the pixels, but that is typically ignored for most of the imagery work we do:
This is not an area I have all that much experience in but after carefully reading this issue, I have a straw man proposal to offer. Before I get to my proposal, I'll offer a few bits of perspective that may be useful to ground the discussion here.
xobj.isel
is sufficient for selecting batches. This works because DataArrays and Datasets do a fairly good job at enforcing alignment. I think there is clearly value in this approach and we should keep it. My thought is that to build a new batch generator class that works with DataTree objects but uses one part of the datatree as its reference for batching the others. I think could be quite similar to the concept @djhoese introduced with Satpy's scene.crop
method.
Imagine if you had a class like this
class BboxBatchGenerator:
def __init__(self, dt: DataTree, ref: str, input_dims, ...):
...
def __iter__(self) -> Iterator[dt.Datatree]:
...
gen = BboxBatchGenerator(dt, ref='10m')
Under the hood, you may imagine using the existing Xbatcher generator for the reference dataset (e.g. 10m
) then using a more complex selection method (e.g. .sel
) to extract batches from subsequent datasets (e.g. 20m
and 60m
).
Underlying my proposal here is that I think we should be open to developing additional generators in Xbatcher. Today we have just BatchGenerator
but supporting additional data models (e.g. datatree) or more complex batch indexing may necessitate new generators.
Just noting that there's a PR on upstreaming datatree
into xarray
at https://github.com/pydata/xarray/pull/7418. If/When that PR is merged and an new xarray
release made, any code depending on datatree
should use from xarray import DataTree
instead of from datatree import DataTree
.
Under the hood, you may imagine using the existing Xbatcher generator for the reference dataset (e.g.
10m
) then using a more complex selection method (e.g..sel
) to extract batches from subsequent datasets (e.g.20m
and60m
).Underlying my proposal here is that I think we should be open to developing additional generators in Xbatcher. Today we have just
BatchGenerator
but supporting additional data models (e.g. datatree) or more complex batch indexing may necessitate new generators.
The BboxBatchGenerator
proposal sounds like a good plan. The current BatchGenerator
is overloaded with many options, and from a user perspective, it might be cleaner to have a separate, dedicated class to handle multi-resolution DataTree objects.
From a developer/maintainer perspective, should this new BboxBatchGenerator
be subclassed from BatchGenerator
? Or will this just make generators.py
way too complicated? Trying to think ahead on what's ideal if we'll have multiple ???BatchGenerator
s that each approach chip sampling a little differently.
Thanks @maxrjones for pointing me to that satpy <-> datatree integration thread at https://github.com/pytroll/satpy/issues/2352 started by @djhoese. If that design proposal goes ahead, it sounds like we might be able to let satpy
handle the multi-resolution data fusion/stacking part into an xarray.DataTree
structure (?), and let xbatcher
focus on just the batch generation/slicing/cropping part. Or if we can find a way to integrate satpy and xbatcher more closely, @djhoese's suggestion of using Satpy's Scene.crop
at https://github.com/xarray-contrib/xbatcher/issues/93#issuecomment-1243025054 might well something we can wrap around in BboxBatchGenerator (at least for the 2D case).
I'll have some time this week to experiment on how xbatcher might work with datatree, will open a PR once I get something working.
@weiji14 have you arrived at some functional approach yet? Did datatree help you here?
For my use case, I am willing to resample all datasets to a resolution with a common divisor so that we can implement simpler/faster index logic with .isel instead of .sel (which would be much slower afaik).
Since datatree will be upstreamed to xarray
(see https://github.com/pydata/xarray/issues/8572), it might be that things will get easier soon, but I don't know when. For your use case though, check out the datatree_slice_generator
function I've implemented in the proof-of-concept PR at https://github.com/xarray-contrib/xbatcher/pull/171, it also relies on .isel
, and might be good enough for basic use.
Since datatree will be upstreamed to
xarray
(see https://github.com/pydata/xarray/issues/8572), it might be that things will get easier soon, but I don't know when. For your use case though, check out thedatatree_slice_generator
function I've implemented in the proof-of-concept PR at https://github.com/xarray-contrib/xbatcher/pull/171, it also relies on.isel
, and might be good enough for basic use.
Awesome, I'll take a look! Being able to support multiple resolutions both in space and time with this datatree approach is likely going to be the approach we take, but I have found the for loop/generator approach in xbatcher has been a bit slow and before I have actually relied on numpys as_strided using a map_block approach. However, chunks are likely not going to spatially align between different resolution datasets, so I'll have to think that through and implement a map_slices on coords approach.
Is your feature request related to a problem?
To enable chipping/batching datasets with different spatial resolutions, each dataset (either an
xarray.DataArray
orxarray.Dataset
) currently needs to be sliced separately in xbatcher v0.1.0. The key limitation is that xbatcher assumes everyxarray.DataArray
'layer' to have the same resolution, andxbatcher.BatchGenerator
would use xarray's.isel
method to index and slice along the specified dimensions.https://github.com/xarray-contrib/xbatcher/blob/72ce00f2b5c9af71f102c5e6543cef5a78e15e35/xbatcher/generators.py#L41-L43
However, this is not always the case, for example:
Describe the solution you'd like
Ideally, there would be:
datatree
might be able to handle, i.e. have each data layer with a different resolution be on a separate node of the datatree.xbatcher
would then need to have a way of slicing these multi-resolution datasets. MaybeDataTree.isel
could work?Describe alternatives you've considered
Keep
xbatcher
to be focused onxarray.DataArray
andxarray.Dataset
only (and not bring inxarray.DataTree
). Users would then need to implement their own way of slicing multi-resolution datasets themselves in an ad-hoc way.Additional context
There was some discussion before at https://github.com/microsoft/torchgeo/issues/279 about sampling in pixel/image units or coordinate reference system (CRS) units. If working with multi-resolution datasets though, sampling in pixel/images would require some math (e.g. 20 pixels for a 500m resolution grid would be 10 pixels for a 1000m resolution grid). The CRS based indexing method however, would require something like https://corteva.github.io/rioxarray/stable/rioxarray.html#rioxarray.raster_dataset.RasterDataset.clip_box.
Other references: