This is based on a brilliant suggestion by @dcherian, which I have used in multiple projects.
I think it would be very nice to document this little 'trick' to efficiently compute a histogram in 'regions' of an array which are encoded in a numbered mask (this is often how e.g. ocean basins are masked in climate models). Instead of masking out one region at a time with something like this:
for num_region in range(4):
da_masked = da.where(da.mask==num_region)
histogram(da_masked, ...)
One can simply add the mask as a second input variable and with appropriate bins, the results will be segmented into the appropriate regions (saving n-1 computations steps; where n is the number of regions).
I wrote up a quick example:
import numpy as np
import xarray as xr
from xhistogram.xarray import histogram
# make a quadrant mask
mask = np.concatenate([
np.concatenate([np.full((2,2), 1), np.full((2,2), 2)]),
np.concatenate([np.full((2,2), 3), np.full((2,2), 4)]),
],axis=1)
# create data that has similar values but with some added noise
data = (np.random.rand(*mask.shape)-0.5)*0.5+mask
da = xr.DataArray(data, dims=['x','y'], name='data', coords={'mask':xr.DataArray(mask, dims=['x','y'], name='mask')})
da
You can see that this successfully segemented the regions and computed the histogram for each (the data is slightly spread around the actual mask number).
I think this really is a brilliant application case for so many science applications, but it is not obvious untils one tries it (at least it wasnt for me until @dcherian suggested it). So I think it would make a great example for the docs?
Id be happy to add this if ppl think it provides value. Maybe @shanicetbailey is interested in collaborating on this, since this might be very helpful in her current project?
This is based on a brilliant suggestion by @dcherian, which I have used in multiple projects.
I think it would be very nice to document this little 'trick' to efficiently compute a histogram in 'regions' of an array which are encoded in a numbered mask (this is often how e.g. ocean basins are masked in climate models). Instead of masking out one region at a time with something like this:
One can simply add the mask as a second input variable and with appropriate bins, the results will be segmented into the appropriate regions (saving n-1 computations steps; where n is the number of regions).
I wrote up a quick example:
You can see that this successfully segemented the regions and computed the histogram for each (the data is slightly spread around the actual mask number).
I think this really is a brilliant application case for so many science applications, but it is not obvious untils one tries it (at least it wasnt for me until @dcherian suggested it). So I think it would make a great example for the docs?
Id be happy to add this if ppl think it provides value. Maybe @shanicetbailey is interested in collaborating on this, since this might be very helpful in her current project?