xgcm / xhistogram

Fast, flexible, label-aware histograms for numpy and xarray
https://xhistogram.readthedocs.io
MIT License
89 stars 19 forks source link

Bug with density calculation when NaNs are present #51

Closed dougiesquire closed 3 years ago

dougiesquire commented 3 years ago

There is a bug in the way histograms are normalised to densities that manifests when there are NaNs in the input data:

import numpy as np
import xarray as xr
import matplotlib.pyplot as plt
from xhistogram.xarray import histogram

data = np.random.normal(size=(10,2))
# Add some nans
N_nans = 6
data.ravel()[np.random.choice(data.size, N_nans, replace=False)] = np.nan
bins = np.linspace(-5,5,5)
bin_centers = 0.5 * (bins[:-1] + bins[1:])

# np.histogram -----
h, _ = np.histogram(data[:,0], bins, density=True)
plt.plot(bin_centers, h, label='numpy histogram')

# xhistogram -----
da = xr.DataArray(
    data, dims=['s', 'x'], 
    coords=[range(data.shape[0]), 
            range(data.shape[1])]).rename('test')
h2 = histogram(da, bins=[bins], dim=['s'], density=True)
plt.plot(bin_centers, h2[0,:], linestyle='--', label='xhistogram')

plt.legend()
plt.xlabel('bins')
plt.ylabel('pdf')
Screen Shot 2021-05-05 at 8 31 17 pm

This bug comes about when there are dimensions that are not being histogram'd ("bystander" dimensions). Currently we sum over all axis to estimate the area/volume of our histogram and then account bystander dimensions as a secondary step. However, this can produce incorrect results when NaNs are present because there may be a different number of NaNs along each bystander dimension.