xiejx5 / baseflow

12 baseflow separation methods with automatic parameter calibration
20 stars 9 forks source link

Multiple baseflow separation methods fails if data contains NaNs #1

Closed wknoben closed 7 months ago

wknoben commented 7 months ago

If any NaNs are present, the returned baseflow estimates contain only NaNs for all methods apart from Fixed and Slide, and KGE values will be undefined for these 10 methods. MWE:

import baseflow
import numpy as np
Q, date = baseflow.load_streamflow(baseflow.example)
Q[1] = np.nan
b, KGEs = baseflow.separation(Q, date, area=276)
print(b)

[(nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan) (nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan) (nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan) ... (nan, nan, 0.92596, 1.39319, nan, nan, nan, nan, nan, nan, nan, nan) (nan, nan, 0.92596, 1.39319, nan, nan, nan, nan, nan, nan, nan, nan) (nan, nan, 1.39319, 1.39319, nan, nan, nan, nan, nan, nan, nan, nan)]

Depending on what is considered intended behaviour, user-friendliness would be improved if either:

  1. A warning is implemented to detect NaNs, and possibly stop baseflow separation execution if so; or,
  2. Perform separation on a temporary interpolated flow vector, and replace any estimated baseflow values with NaN after:
def separation(Q, date=None, area=None, ice=None, method='all', return_kge=True):
    Q = np.array(Q)
    nan_mask = np.isnan(Q)
    indices = np.arange(len(Q))
    Q_interp = np.interp(indices, indices[~nan_mask], Q[~nan_mask])

    [..]

    for m in method:
        if m == 'UKIH':
            b[m] = UKIH(Q_interp, b_LH)
            b[nan_mask] = np.nan

    [..]

Nuances with approach 2 are that the edge values before and after NaNs are not necessarily accurate, and that the KGE function may need further updates to deal with NaNs in either time series.

xiejx5 commented 7 months ago

Thanks a lot for the suggestion. I will solve this issue as soon as possible.

xiejx5 commented 7 months ago

Hi Dr. Knoben,

I have released version 0.0.9, which forces clean_streamflow before separation.

In the previous version, the load_streamflow function was required to be executed first, which implicitly ran clean_streamflow. The new version eliminates the need for load_streamflow and instead accepts a Pandas dataframe as input (the updated ReadMe).

However, current clean_streamflow removes missing values directly, instead of interpolating. Interpolation may not work well with long missing sequences. Given that digital filter would converge to similar values regardless of the initial value, the direct removal of missing values may not substantially impact the separation result.

Thanks again for your suggestion, and I would appreciate your thoughts on the updated version.

Sincerely, Jiaxin

wknoben commented 7 months ago

Hi Jiaxin,

Thanks for looking into this!

It seems to me that both approaches (drop missing values and interpolate them) have the same issue in the sense that the values at the edges (just before and after the missing values) won't be very accurate. So presumably either is as good as things are going to get.

Thanks for letting me contribute further thoughts. My suggestions would be:

  1. Let the user know (somehow) that their data is being cleaned. This can prevent confusion I think.
  2. I expect a common way of using the outputs from your tool is to plot the original hydrograph together with the estimated baseflow part of it. Will this work with the data cleaning approach? For example, if I feed a year of daily data with a 100 NaNs, it would be very convenient if the baseflow time series I get back still has length 365 with the NaNs in the right places, instead of having length 265. Is this currently the case?
xiejx5 commented 7 months ago

Thanks for your prompt response. The new version keeps the NaNs in the right places and returns a dataframe with the same shape as input streamflow. However, to ensure accuracy of recession constant estimates, the clean_streamflow would also drop years with streamflow records less than 120 days, resulting in additional NaNs.

Regarding the first thoughts, I will add a comment to the example code in the next release version. Thanks again for your feedback!