Open weiji14 opened 1 year ago
Thanks for opening this issue @weiji14! Great idea for a refactor to simplify the code base, promote new contributions, and help solve the web of existing issues!
I think when using concat_input_dims=False
, the division between Slicer
and Batcher
that you suggested makes a lot of sense and would be relatively simple to decouple (at least for those who've spent the time getting familiar with the current implementation).
When using concat_input_dims=True
, it's a bit more complicated because batch_dims
can impact slicing. Specifically, the input dataset is sliced on the union of input_dims
and batch_dims
in that case. There are a few options to account for this:
batch_dims
, even when concat_input_dims==True
batch_dims
would need to also be included in Slicer
batch_dims
between the Slicer
and Batcher
componentsBatcher
for this edge caseI expect that option 3 (a separate component for this edge case) would make the most sense. I'll work on this a bit now.
I think this setup would mimic what I'm doing now with my rolling/batching scheme outside of xbatcher. The important thing there is that I can explicitly control the batch sizes, even with predicates involved.
I think if we include predicates though, we need to have a map that can "unbatch" the results because the map may not be straightforward, especially if there are overlaps between the result chips. See #43
What is your issue?
Current state
Currently, xbatcher v0.3.0's
BatchGenerator
is this all-in-one class/function that does too many things, and there are more features planned. The 400+ lines of code at https://github.com/xarray-contrib/xbatcher/blob/v0.3.0/xbatcher/generators.py is not something easy for people to understand and contribute to without spending a few hours. To make things more maintainable and future proof, we might need a major refactor.Proposal
Split
BatchGenerator
into 2 (or more) subcomponents. Specifically:Slicer
that does the slicing/subsetting/cropping/tiling/chipping from a multi-dimensionalxarray
object.Batcher
that groups together the pieces from theSlicer
into batches of data.These are the parameters from the current
BatchGenerator
that would be handled by each component:Slicer:
Batcher:
Benefits
Slicer
andBatcher
158
162
Batcher
side, or in a step post-Batcher
36
127
Slicer
but beforeBatcher
Slicer
run in parallel with theBatcher
. E.g. with a batch_size of 128,Slicer
would load data up to 128 chips, pass it on toBatcher
and feed it to the ML model, while the next round of data processing happens. This is without loading everything into memory.Batcher
when the batches have been generated already. Sometimes though, people might want to setbatch_size
as a hyperparameter in their ML experimentation, in which case the cache should be done afterSlicer
.Cons