xgcm / xhistogram

Fast, flexible, label-aware histograms for numpy and xarray
https://xhistogram.readthedocs.io
MIT License
90 stars 20 forks source link

Refactor approach for making final bin right-edge inclusive #44

Closed dougiesquire closed 3 years ago

dougiesquire commented 3 years ago

Description

Our current approach for making the last bin right-edge inclusive (to align with np.histogram definitions) is a hack stolen from scikit-learn - we simply add a very small increment to the last bin edge prior to digitizing.

Among other problems, this means that xhistogram currently fails for non-float bins (for example, binning over datetime objects, see #25).

This PR implements an alternative, hopefully better, approach.

Type of change

Testing

dougiesquire commented 3 years ago

You'll see that the test I added is failing. This is because np.digitize can't cast data from dtype('<M8[D]') to dtype('float64') (things are actually failing when checking the monoticity of the data within np.digitize).

There seem to be ways to get around this, but I'm looking for advice on best way to do this robustly. Should we only deal with np.datetime64 or try to include other datetime objects also? @WardBrian perhaps you have some suggestions?

I should point out, np.histogram can deal with datetime64 bins, but it uses searchsorted (which assumes sorted data) rather than digitize and sorts the data as a separate step (using np.sort). So an alternative is to refactor xhistogram to use searchsorted as well (though note there is a dask version of digitize but not searchsorted)...

codecov[bot] commented 3 years ago

Codecov Report

Merging #44 (4fdd716) into master (8a6765a) will increase coverage by 11.90%. The diff coverage is 100.00%.

Impacted file tree graph

@@             Coverage Diff             @@
##           master      #44       +/-   ##
===========================================
+ Coverage   84.08%   95.98%   +11.90%     
===========================================
  Files           2        2               
  Lines         245      249        +4     
  Branches       74       71        -3     
===========================================
+ Hits          206      239       +33     
+ Misses         34        7       -27     
+ Partials        5        3        -2     
Impacted Files Coverage Δ
xhistogram/core.py 97.40% <100.00%> (+15.68%) :arrow_up:
xhistogram/xarray.py 91.07% <0.00%> (-0.46%) :arrow_down:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 8a6765a...4fdd716. Read the comment docs.

dougiesquire commented 3 years ago

I think that because of our reliance on digitize, it's going to be difficult for us to generally support a range of bin data types. Instead, we might have to expect the user to cast their data to a type that digitize can in turn cast to float64.

Despite this, I think the approach in this PR for including the right edge of the right bin is still better than the old approach and is maybe still worth merging. Thoughts @rabernat?

rabernat commented 3 years ago

Hi Dougie, sorry it has taken so long for me to review this. I am a bit under water right now. Maybe @TomNicholas could provide a review?

dougiesquire commented 3 years ago

Thanks @rabernat, and all good - feeling pretty under water here too.

This PR was motivated by #25, so the test I originally wrote was to compute the histogram of datetime objects. I've added this test back in, but you'll see it's failing. This is because np.digitize internally checks for monotonicity and this function tries to cast everything to float64. I see two options:

cc @TomNicholas

TomNicholas commented 3 years ago

Someone who understands casting dtypes better than me suggests a robust way for us to handle datetime objects,

I have no idea if that's possible or not, but @spencerkclark has done lots with datetimes and might be able to tell us before we put it in xarray?

Refactor _bincount_2d_vectorized to use np.sort + np.searchsorted instead of np.digitize

I think this would be preferable really, and maybe not too difficult if we can copy np.histogram? I've added it to the to-do list on my xarray PR anyway.

(though note there is a dask version of digitize but not searchsorted)

With the blockwise method of dask-ifying we don't need there to be a dask.array.searchsorted - bincount_2d_vectorized will only ever see a single chunk.

Illviljan commented 3 years ago

FYI, dask has added searchsorted recently https://github.com/dask/dask/pull/7696.

dougiesquire commented 3 years ago

(though note there is a dask version of digitize but not searchsorted)

With the blockwise method of dask-ifying we don't need there to be a dask.array.searchsorted - bincount_2d_vectorized will only ever see a single chunk.

Yup, this comment was from before the blockwise refactor. I agree using searchsorted should be pretty straightforward now.

spencerkclark commented 3 years ago

I have no idea if that's possible or not, but @spencerkclark has done lots with datetimes and might be able to tell us before we put it in xarray?

We have fairly well-tested ways of converting datetimes to integers or floats in xarray, which you could use to work around this issue; however, it's a little unfortunate that it is required in this case. Checking for monotonicity -- and I think the digitize operation in general -- should not need to depend on converting to floats. You could consider posting an issue in the NumPy repo about this, but if you can work around it by using searchsorted instead, that would be much cleaner!

dougiesquire commented 3 years ago

You could consider posting an issue in the NumPy repo about this, but if you can work around it by using searchsorted instead, that would be much cleaner!

Great, thanks for the advice @spencerkclark !

dougiesquire commented 3 years ago

I think this is ready for review. @TomNicholas, would you mind taking a quick look?