xgcm / xhistogram

Fast, flexible, label-aware histograms for numpy and xarray
https://xhistogram.readthedocs.io
MIT License
90 stars 20 forks source link

Trouble creating histogram labels for each 2D array in time series #10

Closed rbavery closed 5 years ago

rbavery commented 5 years ago

Hi @rabernat thanks for the suggestion on this stack overflow post: https://stackoverflow.com/questions/57419541/how-to-calculate-histogram-bins-for-each-image-in-an-xarray-dataarray-time-serie. I'm posting here since Stack Overflow's strict character limit cut out some of the code blocks below.

I've tried out the example with what I think are the correct arguments like so

import numpy as np
import xarray as xr
def tseries_histogram_labels(data_arr, rmin, rmax, nbins):
    bin_arr = np.linspace(rmin, rmax, nbins)
    time_len = data_arr.time.shape[0]
    bin_arrs = list(bin_arr * np.ones((time_len,1)))
    return histogram(*data_arr, bins=bin_arrs, dim=['x','y'])
t_series['reflectance'].name = 'reflectance'
tseries_histogram_labels(t_series['reflectance'].sel(band=1), -4, 4, 50)

where t_series.sel(blue=1) is a DataArray with time, x, and y dimensions. Calling *data_arr in the return statement unpacks the DataArray along the time dimension into separate DataArrays, if this is not done there is a length mismatch between the bin array list and the DataArray.

The list of bin edge arrays and the list of Data Arrays match, and I want to take the histogram of each Data Array. I've tried with and without specifying dim=['x', 'y'], but I get the same error:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
 in 
      7     return histogram(*data_arr, bins=bin_arrs, dim=['x','y'])
      8 t_series['reflectance'].name = 'reflectance'
----> 9 tseries_histogram_labels(t_series['reflectance'].sel(band=1), -4, 4, 50)

 in tseries_histogram_labels(data_arr, rmin, rmax, nbins)
      5     time_len = data_arr.time.shape[0]
      6     bin_arrs = list(bin_arr * np.ones((time_len,1)))
----> 7     return histogram(*data_arr, bins=bin_arrs, dim=['x','y'])
      8 t_series['reflectance'].name = 'reflectance'
      9 tseries_histogram_labels(t_series['reflectance'].sel(band=1), -4, 4, 50)

~/miniconda3/envs/pyatsa/lib/python3.7/site-packages/xhistogram/xarray.py in histogram(bins, dim, weights, density, block_size, bin_dim_suffix, bin_edge_suffix, *args)
    127 
    128     h_data = _histogram(*args_data, weights=weights_data, bins=bins, axis=axis,
--> 129                         block_size=block_size)
    130 
    131     # create output dims

~/miniconda3/envs/pyatsa/lib/python3.7/site-packages/xhistogram/core.py in histogram(bins, axis, weights, density, block_size, *args)
    245     h = _histogram_2d_vectorized(*all_args_reshaped, bins=bins,
    246                                  weights=weights_reshaped,
--> 247                                  density=density, block_size=block_size)
    248 
    249     if h.shape[0] == 1:

~/miniconda3/envs/pyatsa/lib/python3.7/site-packages/xhistogram/core.py in _histogram_2d_vectorized(bins, weights, density, right, block_size, *args)
    117     # product of the bins gives the joint distribution
    118     if N_inputs > 1:
--> 119         bin_indices = ravel_multi_index(each_bin_indices, hist_shapes)
    120     else:
    121         bin_indices = each_bin_indices[0]

~/miniconda3/envs/pyatsa/lib/python3.7/site-packages/xhistogram/duck_array_ops.py in ravel_multi_index(multi_index, dims, order)
     54                                  for n in range(1, len(dims))]
     55     full_index = [of * ix for of, ix in zip(offset_factors, multi_index)]
---> 56     return sum(full_index)
     57 
     58 

TypeError: unsupported operand type(s) for +: 'int' and 'Array'

I'm not sure if this is a bug or if I'm doing something wrong. I tried to create a minimal example that doesn't require a file, but ran into a different issue, https://github.com/xgcm/xhistogram/issues/9

Happy to make this an example of how to use xhistogram with dask if we can sort this out and any feedback is appreciated!

rabernat commented 5 years ago

Thanks for the bug report @rbavery! Can you try to update this example given the suggestions I made in #9 and see if you can get any closer?

I appreciate your patience and willingness to work with us on this new package.

rbavery commented 5 years ago

Thanks @rabernat it works as expected. but should the time coordinates be dropped?

[88]

import numpy as np
import xarray as xr
t_series['reflectance'].name = 'reflectance'
bin_arr = np.linspace(rmin, rmax, nbins)
result = histogram(t_series['reflectance'].sel(band=1), bins=[bin_arr], dim=['x','y'])

[89]

result
<xarray.DataArray 'histogram_reflectance' (time: 44, reflectance_bin: 49)>
dask.array<shape=(44, 49), dtype=int64, chunksize=(1, 49)>
Coordinates:
  * reflectance_bin  (reflectance_bin) float64 -3.918 -3.755 ... 3.755 3.918
Dimensions without coordinates: time

the original xarray has time coordinates

[92]

t_series['reflectance'].sel(band=1)
<xarray.DataArray 'reflectance' (time: 44, y: 1082, x: 1084)>
dask.array<shape=(44, 1082, 1084), dtype=uint16, chunksize=(1, 1082, 1084)>
Coordinates:
    band     int64 1
  * y        (y) float64 9.705e+05 9.705e+05 9.705e+05 ... 9.673e+05 9.672e+05
  * x        (x) float64 4.889e+05 4.889e+05 4.889e+05 ... 4.922e+05 4.922e+05
  * time     (time) datetime64[ns] 2018-10-12 2018-10-16 ... 2019-05-26
Attributes:
    transform:   (3.0, 0.0, 488907.0, 0.0, -3.0, 970494.0)
    crs:         +init=epsg:32630
    res:         (3.0, 3.0)
    is_tiled:    1
    nodatavals:  (1.0, 1.0, 1.0, 1.0)
    scales:      (1.0, 1.0, 1.0, 1.0)
    offsets:     (0.0, 0.0, 0.0, 0.0)
rabernat commented 5 years ago

Thanks @rabernat it works as expected. but should the time coordinates be dropped?

This was a bug that was fixed in #8. We need a new release.