[Enhancement]: time subsetting/slicing on dataset with non-chronological time coordinate

bosup commented 1 week ago

Is your feature request related to a problem?

the time slicing method, e.g, ds.sel(time=slice("2099-01-01", "2100-12-31"))

doesn't work on dataset with time data corrupted or not in chronological order, e.g., va_day_NorESM1-M_rcp85_r1i1p1_20960101-21001231.nc (data too big to be uploaded) corrupted time coordinates […, 34561.5, 34562.5, 34563.5, 34564.5, 34565.5, 34566.5, 34567.5, 34568.5, 34569.5, 34570.5, 34571.5, 34572.5, 34573.5, 34574.5, 34575.5, 34576.5, 34577.5, 34578.5, 34579.5, 34580.5, 34581.5, 34582.5, 34583.5, 34584.5, 34585.5, 34586.5, 34587.5, 34588.5, 34589.5, 34590.5, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ]

Another example ua_day_CCSM4_rcp85_r6i1p1_20060101-20091231.nc (data too big to be uploaded) <xarray.DataArray 'time' (time: 1460)> Size: 12kB array([cftime.DatetimeNoLeap(2006, 1, 1, 12, 0, 0, 0, has_year_zero=True), cftime.DatetimeNoLeap(2006, 1, 2, 12, 0, 0, 0, has_year_zero=True), cftime.DatetimeNoLeap(2006, 1, 3, 12, 0, 0, 0, has_year_zero=True), ..., cftime.DatetimeNoLeap(2005, 1, 1, 0, 0, 0, 0, has_year_zero=True), cftime.DatetimeNoLeap(2005, 1, 1, 0, 0, 0, 0, has_year_zero=True), cftime.DatetimeNoLeap(2005, 1, 1, 0, 0, 0, 0, has_year_zero=True)], dtype=object)

is it possible that when using the slice method, it can at least return timesteps that is within the specified slice range, rather than corrupted with error?

Describe the solution you'd like

No response

Describe alternatives you've considered

No response

Additional context

No response

pochedls commented 1 week ago

@bosup – are these datasets on disk in the climate program or at NERSC? If so, could you share their paths?

bosup commented 1 week ago

@bosup – are these datasets on disk in the climate program or at NERSC? If so, could you share their paths?

@pochedls the data is at NERSC, /global/cfs/projectdirs/m4581/ARTMIP_CMIP56/example_data/va_day_NorESM1-M_rcp85_r1i1p1_20960101-21001231.nc

pochedls commented 1 week ago

@bosup – you say above you get an error. Can you share the code to produce the error and the error message itself?

It looks like the last 84 time values in this file are 0. Since the units are days since 2006-01-01 00:00:00, this decodes to a constant 2006-01-01 00:00:00 (but should be 2100-10-9 12:00:00 to 2100-12-31 12:00:00). With that said, you can still slice this data by selecting the data indices you want, e.g.,

ds.isel(time=slice(1750, 1824))

Would this work for you?

This file has an obvious data issue. I'm guessing xCDAT probably can't incorporate code generic enough to detect and solve the time issue, but we could potentially help you develop a function for this type of data issue (if the other datasets you have identified have the same problem).

bosup commented 1 week ago

@bosup – you say above you get an error. Can you share the code to produce the error and the error message itself?

It looks like the last 84 time values in this file are 0. Since the units are days since 2006-01-01 00:00:00, this decodes to a constant 2006-01-01 00:00:00 (but should be 2100-10-9 12:00:00 to 2100-12-31 12:00:00). With that said, you can still slice this data by selecting the data indices you want, e.g.,

ds.isel(time=slice(1750, 1824))

Would this work for you?

This file has an obvious data issue. I'm guessing xCDAT probably can't incorporate code generic enough to detect and solve the time issue, but we could potentially help you develop a function for this type of data issue (if the other datasets you have identified have the same problem).

Thanks @pochedls !
the errors came from .sel method rather than .isel

import xcdat as xc
ds = xc.open_dataset("va_day_NorESM1-M_rcp85_r1i1p1_20960101-21001231.nc")
ds.sel(time=slice("2071-01-01","2100-12-31"))

The error message is Traceback (most recent call last): File "/global/homes/d/user/miniconda3/envs/PMP/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 3805, in get_loc return self._engine.get_loc(casted_key) File "index.pyx", line 167, in pandas._libs.index.IndexEngine.get_loc File "index.pyx", line 191, in pandas._libs.index.IndexEngine.get_loc File "index.pyx", line 234, in pandas._libs.index.IndexEngine._get_loc_duplicates File "index.pyx", line 242, in pandas._libs.index.IndexEngine._maybe_get_bool_indexer File "index.pyx", line 134, in pandas._libs.index._unpack_bool_indexer KeyError: cftime.DatetimeNoLeap(2071, 1, 1, 0, 0, 0, 0, has_year_zero=True)

lee1043 commented 1 week ago

@bosup I think @pochedls suggested .isel as an alternative of .sel because in this case time coordinate of the data is not correct -- which is then the issue is a data issue than a xcdat issue. I think in such case time coordinate needs to be regenerated for that data once after loaded into the memory. Challenge might be in how to handle that systematically, which I am yet to have a good idea.

pochedls commented 4 days ago

From our discussion:

import xarray as xr
time = xr.coding.cftime_offsets.date_range("2096-01-01", "2100-12-31", calendar="noleap")
ds['time'] = time

Useful bit of code:

def diddle(ds):                                                                            
    try:                                                                                   
        ds.sel(time=slice("2096-01-01","2096-12-31"))                                      
    except KeyError as e:                                                                  
        if "cftime" in str(e):                                                             
            print('blah')

lee1043 commented 4 days ago

It looks like xarray.cftime_range wraps up xr.coding.cftime_offsets.date_range.

lee1043 commented 2 days ago

Hi @bosup,

I made a quick custom function for your data issue that can detect data issue for daily time axis. I recall you have to loop through many models, and want to check the data systematically without loosing too much calculation efficiency. Below function checks if day in time axis incrementally increase. Can you see if this helps your workflow?

import numpy as np
import cftime

def daily_time_axis_checker(ds, time_key="time"):
    """
    Check if the time axis in an xarray dataset follows a correct daily sequence, considering all CFTime calendars.

    Parameters
    ----------
    ds : xarray.Dataset or xarray.DataArray
        The dataset or data array containing the time axis to be checked.
    time_key : str, optional
        The key corresponding to the time dimension in the dataset (default is 'time').

    Returns
    -------
    bool
        True if the time axis has incrementally increasing days, otherwise False.

    Raises
    ------
    ValueError
        If the time axis does not follow a correct daily sequence.

    Example
    -------
    >>> ds = xr.Dataset({"time": xr.cftime_range("2000-01-01", periods=400, freq="D", calendar="gregorian")})  # dummy data to test
    >>> daily_time_axis_checker(ds, "time")
    True
    """
    # Extract the time values from the dataset
    time_data = ds[time_key].values

    # Check if the time axis is based on CFTime objects
    if not isinstance(time_data[0], cftime.datetime):
        raise ValueError(f"The time axis does not use CFTime objects (uses {type(time_data[0])}).")

    # Convert time_data into numpy datetime64 for delta comparison
    for i in range(1, len(time_data)):
        # Calculate the difference in days between consecutive time points
        delta = time_data[i] - time_data[i - 1]

        # Check if the difference is exactly 1 day (as a timedelta64 object)
        if np.abs(delta) != np.timedelta64(1, 'D'):
            print('Time axis is not correct!')
            print(f'Error at index {i}: {time_data[i - 1]} to {time_data[i]}')
            return False

    return True

Use this function for your dataset

daily_time_axis_checker(ds, "time")

And its result returns False because data has incorrect time axis.

Time axis is not correct!
Error at index 1741: 2100-10-08 12:00:00 to 2006-01-01 00:00:00
False

Proposed routine for your workflow -- if any data with the same issue is detected, regenerate time and time bounds.

if not daily_time_axis_checker(ds):
    print('regenerate time axis and bounds')
    # regenerate time axis
    time = xr.cftime_range("2096-01-01", "2100-12-31", calendar=ds.time.dt.calendar)
    ds['time'].values[:] = time
    ds = ds.drop_vars(["time_bnds"]).bounds.add_missing_bounds(["T"])

As this is not xcdat issue but data issue, I am closing this issue, but please feel free to reopen it when you still have related issue.

lee1043 commented 2 days ago

@bosup And if the above code works well for you I am going to implement it to PMP.

bosup commented 1 day ago

@bosup And if the above code works well for you I am going to implement it to PMP.

Thank you @lee1043 This is very useful for detecting data with bad time coordinates! probably not an ultimate solution for fixing the automated workflow of looping through a list of data files as a time range has to be manually specified e.g., "2096-01-01", "2100-12-31". But as we discussed, it is a data issue. Thanks for the workaround!

xCDAT / xcdat