xarray-contrib / xbatcher

Batch generation from xarray datasets
https://xbatcher.readthedocs.io
Apache License 2.0
166 stars 27 forks source link

Separate BatchGenerator into standalone Slicer and Batcher components? #172

Open weiji14 opened 1 year ago

weiji14 commented 1 year ago

What is your issue?

Current state

Currently, xbatcher v0.3.0's BatchGenerator is this all-in-one class/function that does too many things, and there are more features planned. The 400+ lines of code at https://github.com/xarray-contrib/xbatcher/blob/v0.3.0/xbatcher/generators.py is not something easy for people to understand and contribute to without spending a few hours. To make things more maintainable and future proof, we might need a major refactor.

Proposal

Split BatchGenerator into 2 (or more) subcomponents. Specifically:

  1. A Slicer that does the slicing/subsetting/cropping/tiling/chipping from a multi-dimensional xarray object.
  2. A Batcher that groups together the pieces from the Slicer into batches of data.

These are the parameters from the current BatchGenerator that would be handled by each component:

Slicer:

Batcher:

Benefits

Cons

maxrjones commented 1 year ago

Thanks for opening this issue @weiji14! Great idea for a refactor to simplify the code base, promote new contributions, and help solve the web of existing issues!

I think when using concat_input_dims=False, the division between Slicer and Batcher that you suggested makes a lot of sense and would be relatively simple to decouple (at least for those who've spent the time getting familiar with the current implementation).

When using concat_input_dims=True, it's a bit more complicated because batch_dims can impact slicing. Specifically, the input dataset is sliced on the union of input_dims and batch_dims in that case. There are a few options to account for this:

  1. Break backwards compatibility by not ever slicing on batch_dims, even when concat_input_dims==True
  2. batch_dims would need to also be included in Slicer
  3. A third component could handle slicing on batch_dims between the Slicer and Batcher components
  4. Additional slicing would happen in Batcher for this edge case

I expect that option 3 (a separate component for this edge case) would make the most sense. I'll work on this a bit now.

cmdupuis3 commented 1 year ago

I think this setup would mimic what I'm doing now with my rolling/batching scheme outside of xbatcher. The important thing there is that I can explicitly control the batch sizes, even with predicates involved.

I think if we include predicates though, we need to have a map that can "unbatch" the results because the map may not be straightforward, especially if there are overlaps between the result chips. See #43