Closed bosup closed 2 days ago
@bosup – are these datasets on disk in the climate program or at NERSC? If so, could you share their paths?
@bosup – are these datasets on disk in the climate program or at NERSC? If so, could you share their paths?
@pochedls the data is at NERSC, /global/cfs/projectdirs/m4581/ARTMIP_CMIP56/example_data/va_day_NorESM1-M_rcp85_r1i1p1_20960101-21001231.nc
@bosup – you say above you get an error. Can you share the code to produce the error and the error message itself?
It looks like the last 84 time values in this file are 0
. Since the units are days since 2006-01-01 00:00:00
, this decodes to a constant 2006-01-01 00:00:00
(but should be 2100-10-9 12:00:00
to 2100-12-31 12:00:00
). With that said, you can still slice this data by selecting the data indices you want, e.g.,
ds.isel(time=slice(1750, 1824))
Would this work for you?
This file has an obvious data issue. I'm guessing xCDAT probably can't incorporate code generic enough to detect and solve the time issue, but we could potentially help you develop a function for this type of data issue (if the other datasets you have identified have the same problem).
@bosup – you say above you get an error. Can you share the code to produce the error and the error message itself?
It looks like the last 84 time values in this file are
0
. Since the units aredays since 2006-01-01 00:00:00
, this decodes to a constant2006-01-01 00:00:00
(but should be2100-10-9 12:00:00
to2100-12-31 12:00:00
). With that said, you can still slice this data by selecting the data indices you want, e.g.,
ds.isel(time=slice(1750, 1824))
Would this work for you?
This file has an obvious data issue. I'm guessing xCDAT probably can't incorporate code generic enough to detect and solve the time issue, but we could potentially help you develop a function for this type of data issue (if the other datasets you have identified have the same problem).
Thanks @pochedls !
the errors came from .sel method rather than .isel
import xcdat as xc
ds = xc.open_dataset("va_day_NorESM1-M_rcp85_r1i1p1_20960101-21001231.nc")
ds.sel(time=slice("2071-01-01","2100-12-31"))
The error message is Traceback (most recent call last): File "/global/homes/d/user/miniconda3/envs/PMP/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 3805, in get_loc return self._engine.get_loc(casted_key) File "index.pyx", line 167, in pandas._libs.index.IndexEngine.get_loc File "index.pyx", line 191, in pandas._libs.index.IndexEngine.get_loc File "index.pyx", line 234, in pandas._libs.index.IndexEngine._get_loc_duplicates File "index.pyx", line 242, in pandas._libs.index.IndexEngine._maybe_get_bool_indexer File "index.pyx", line 134, in pandas._libs.index._unpack_bool_indexer KeyError: cftime.DatetimeNoLeap(2071, 1, 1, 0, 0, 0, 0, has_year_zero=True)
@bosup I think @pochedls suggested .isel
as an alternative of .sel
because in this case time coordinate of the data is not correct -- which is then the issue is a data issue than a xcdat issue. I think in such case time coordinate needs to be regenerated for that data once after loaded into the memory. Challenge might be in how to handle that systematically, which I am yet to have a good idea.
From our discussion:
import xarray as xr
time = xr.coding.cftime_offsets.date_range("2096-01-01", "2100-12-31", calendar="noleap")
ds['time'] = time
Useful bit of code:
def diddle(ds):
try:
ds.sel(time=slice("2096-01-01","2096-12-31"))
except KeyError as e:
if "cftime" in str(e):
print('blah')
It looks like xarray.cftime_range
wraps up xr.coding.cftime_offsets.date_range
.
Hi @bosup,
I made a quick custom function for your data issue that can detect data issue for daily time axis. I recall you have to loop through many models, and want to check the data systematically without loosing too much calculation efficiency. Below function checks if day in time axis incrementally increase. Can you see if this helps your workflow?
import numpy as np
import cftime
def daily_time_axis_checker(ds, time_key="time"):
"""
Check if the time axis in an xarray dataset follows a correct daily sequence, considering all CFTime calendars.
Parameters
----------
ds : xarray.Dataset or xarray.DataArray
The dataset or data array containing the time axis to be checked.
time_key : str, optional
The key corresponding to the time dimension in the dataset (default is 'time').
Returns
-------
bool
True if the time axis has incrementally increasing days, otherwise False.
Raises
------
ValueError
If the time axis does not follow a correct daily sequence.
Example
-------
>>> ds = xr.Dataset({"time": xr.cftime_range("2000-01-01", periods=400, freq="D", calendar="gregorian")}) # dummy data to test
>>> daily_time_axis_checker(ds, "time")
True
"""
# Extract the time values from the dataset
time_data = ds[time_key].values
# Check if the time axis is based on CFTime objects
if not isinstance(time_data[0], cftime.datetime):
raise ValueError(f"The time axis does not use CFTime objects (uses {type(time_data[0])}).")
# Convert time_data into numpy datetime64 for delta comparison
for i in range(1, len(time_data)):
# Calculate the difference in days between consecutive time points
delta = time_data[i] - time_data[i - 1]
# Check if the difference is exactly 1 day (as a timedelta64 object)
if np.abs(delta) != np.timedelta64(1, 'D'):
print('Time axis is not correct!')
print(f'Error at index {i}: {time_data[i - 1]} to {time_data[i]}')
return False
return True
Use this function for your dataset
daily_time_axis_checker(ds, "time")
And its result returns False
because data has incorrect time axis.
Time axis is not correct!
Error at index 1741: 2100-10-08 12:00:00 to 2006-01-01 00:00:00
False
Proposed routine for your workflow -- if any data with the same issue is detected, regenerate time and time bounds.
if not daily_time_axis_checker(ds):
print('regenerate time axis and bounds')
# regenerate time axis
time = xr.cftime_range("2096-01-01", "2100-12-31", calendar=ds.time.dt.calendar)
ds['time'].values[:] = time
ds = ds.drop_vars(["time_bnds"]).bounds.add_missing_bounds(["T"])
As this is not xcdat issue but data issue, I am closing this issue, but please feel free to reopen it when you still have related issue.
@bosup And if the above code works well for you I am going to implement it to PMP.
@bosup And if the above code works well for you I am going to implement it to PMP.
Thank you @lee1043 This is very useful for detecting data with bad time coordinates! probably not an ultimate solution for fixing the automated workflow of looping through a list of data files as a time range has to be manually specified e.g., "2096-01-01", "2100-12-31". But as we discussed, it is a data issue. Thanks for the workaround!
Is your feature request related to a problem?
the time slicing method, e.g, ds.sel(time=slice("2099-01-01", "2100-12-31"))
doesn't work on dataset with time data corrupted or not in chronological order, e.g., va_day_NorESM1-M_rcp85_r1i1p1_20960101-21001231.nc (data too big to be uploaded) corrupted time coordinates […, 34561.5, 34562.5, 34563.5, 34564.5, 34565.5, 34566.5, 34567.5, 34568.5, 34569.5, 34570.5, 34571.5, 34572.5, 34573.5, 34574.5, 34575.5, 34576.5, 34577.5, 34578.5, 34579.5, 34580.5, 34581.5, 34582.5, 34583.5, 34584.5, 34585.5, 34586.5, 34587.5, 34588.5, 34589.5, 34590.5, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ]
Another example ua_day_CCSM4_rcp85_r6i1p1_20060101-20091231.nc (data too big to be uploaded) <xarray.DataArray 'time' (time: 1460)> Size: 12kB array([cftime.DatetimeNoLeap(2006, 1, 1, 12, 0, 0, 0, has_year_zero=True), cftime.DatetimeNoLeap(2006, 1, 2, 12, 0, 0, 0, has_year_zero=True), cftime.DatetimeNoLeap(2006, 1, 3, 12, 0, 0, 0, has_year_zero=True), ..., cftime.DatetimeNoLeap(2005, 1, 1, 0, 0, 0, 0, has_year_zero=True), cftime.DatetimeNoLeap(2005, 1, 1, 0, 0, 0, 0, has_year_zero=True), cftime.DatetimeNoLeap(2005, 1, 1, 0, 0, 0, 0, has_year_zero=True)], dtype=object)
is it possible that when using the slice method, it can at least return timesteps that is within the specified slice range, rather than corrupted with error?
Describe the solution you'd like
No response
Describe alternatives you've considered
No response
Additional context
No response