Open tomvothecoder opened 1 year ago
Just opening this issue to have it in our backlog for further consideration
@tomvothecoder As now we can do like this (no specification of axis): ds = ds.bounds.add_missing_bounds()
How about extending add_bounds() with an override boolean attribute with below options:
override=False
-> add missed bounds, but without raising key error if bounds existoverride=True
-> override existing bounds for coordinates related to an axisJust weighing in that I usually lean toward convenience, but in this case I'm not so sure. It could lead users to systematically ignore the bounds that the dataset creator took the time to add. Could this be justified with some analysis (e.g., when we drop bounds, xcdat generally replaces the bounds with the correct values AND in some cases it actually leads to better bounds)?
Just weighing in that I usually lean toward convenience, but in this case I'm not so sure. It could lead users to systematically ignore the bounds that the dataset creator took the time to add. Could this be justified with some analysis (e.g., when we drop bounds, xcdat generally replaces the bounds with the correct values AND in some cases it actually leads to better bounds)?
For users who analyze individual datasets at a time, override=True
might be a good option if they have the knowledge that the bounds in their datasets are not as "good" as the ones produced by xCDAT.
On the other hand, for package developers, override=True
might cause issues because xCDAT will always override existing bounds for whatever dataset(s) passed regardless if the existing bounds are already "correct". Some package devs might use override=True
without understanding the consequences of this assumption in their APIs. I think this is related to what you're saying by "it could lead users to systematically ignore the bounds that the dataset creator took the time to add." It can also apply to users who loop through many datasets at a time in their analysis work and set override=True
.
@pochedls @lee1043 do you guys typically use bounds from the dataset as the default option in most cases (if they exist) or mainly use xCDAT's add bounds/add missing bounds APIs.
My thought was that the dataset bounds should be the higher priority for most cases.
My regular code flow in the PMP is like below. As I am dealing with multiple models knowing some models may have their bounds info missed, I use add_missing_bounds
for all models to fill the missed ones if there is any.
for model in models:
# open
ds = xc.open_dataset(model_files)
ds = ds.bounds.add_missing_bounds()
# Then analyze
....
@lee1043 – this will add the missing bounds, but this ticket is advocating for dropping the dataset bounds and then having xcdat trying to infer them.
Oh,,,, @pochedls thanks for clarifying, I must have confused it with the add_missing_bounds
! Sorry for the confusion. What I suggested here is about the add_missing_bounds
, not add_bounds
.
Oh,,,, @pochedls thanks for clarifying, I must have confused it with the
add_missing_bounds
! Sorry for the confusion. What I suggested here is about theadd_missing_bounds
, notadd_bounds
.
I think the same scenarios and questions from the comments above will apply to add_missing_bounds()
too.
My understanding is that any existing bounds take priority over xCDAT generated bounds. If the existing bounds are incorrect for some datasets, it would be a data quality issue that should probably be addressed explicitly by the user rather than systematically dropping and replacing bounds using override=True
(especially to many datasets in a for loop).
If package devs use add_missing_bounds(override=True)
in their APIs, users might not be aware of this implicit behavior going on behind the scenes which might be a bit dangerous.
@tomvothecoder agreed, thank you for clarifying!
I was considering the idea of a ds.bounds.drop_bounds()
API for convenience and special cases where bounds are incorrect. It could be nice to quickly drop bounds without having to use other xCDAT ds.bounds.keys
to get the bounds key(s) to drop with Xarray. However, it might inadvertently encourage users to drop existing bounds in favor of xCDAT bounds for all of their use cases, similar to what was mentioned above. Users also might drop_bounds()
then forget to add_bounds()
/add_missing_bounds()
, which will result in xCDAT APIs throwing errors about missing bounds.
I think we can leave this type of data quality issue in the hands of the users. @pochedls and @lee1043 I'll close this issue if we're all on the same page here. Thanks!
@tomvothecoder I am okay with closing the issue, thank you for the discussion.
I know this is closed, but if there were obvious cases where we should drop a dataset's bounds and re-generate them it would be useful to document them (in case we wanted to consider an override
feature in the future).
@pochedls Sure, we can re-open this issue to gather ideas. If we don't have any substantial reason to have this feature, then we can re-close it.
I know this is closed, but if there were obvious cases where we should drop a dataset's bounds and re-generate them it would be useful to document them (in case we wanted to consider an
override
feature in the future).
I think I have had such cases that I wanted to drop and regenerate bounds systematically for collective operation across different models because some models' bounds information was not correctly generated. @tomvothecoder I like the idea of drop_bounds()
and as @pochedls said we should document that in the API doc for the drop_bounds()
. How about that?
I was suggesting documenting the examples where this is an issue (before beginning work on this). I think I've seen this in FGOALS, but I think it is pretty rare. If you are routinely dropping bounds and re-generating them, it would be helpful to demonstrate that you fix more problems than you introduce. If you loop over 50 models, do you reproduce the correct bounds for the vast majority of models that are fine (while fixing the rare model that has bad bounds)?
Below is one example case from CMIP5, ACCESS 1-0 model, whose time bounds are not aligned with time step.
input_file = "ts_Amon_ACCESS1-0_historical_r1i1p1_185001-200512.nc"
ds = xc.open_dataset(input_file)
ds["time"]
ds["time_bnds"]
In such case, maybe something like drop_time_bounds()
and add_time_bounds()
might be helpful if exist.
If I loop over 50 models, I might reproduce the correct bounds for the vast majority of models that are fine while fixing the rare model that has bad bounds. But I feel that would be more efficient than identifying the bad ones each time that happens, which I think is not rare.
My workaround considering such cases is like this:
def reset_time_bounds(ds, freq="month"):
time_bnds_key = get_time_bounds_key(ds)
return ds.drop_vars([time_bnds_key]).bounds.add_time_bounds("freq", freq=freq)
Note: get_time_bounds_key()
Just following up after some time of inactivity in this ticket.
If I loop over 50 models, I might reproduce the correct bounds for the vast majority of models that are fine while fixing the rare model that has bad bounds. But I feel that would be more efficient than identifying the bad ones each time that happens, which I think is not rare.
@pochedls @lee1043 Is there a simple, programmatic way of identifying which models have bad bounds and only dropping and regenerating bounds for those models? It seems more computationally efficient and "correct" compared to dropping bounds for all models regardless if the bounds are correct or not (even if only a few are bad), then regenerating them.
I'm not sure...I think suspicious bounds could be identified, but in the example @lee1043 gives above – I'm pretty sure they are wrong, but I don't know they are wrong. I don't really care if people systematically remove/regenerate the bounds as long as it is an option that is off by default (i.e., requires the user to do something to produce this behavior).
Just following up after some time of inactivity in this ticket.
If I loop over 50 models, I might reproduce the correct bounds for the vast majority of models that are fine while fixing the rare model that has bad bounds. But I feel that would be more efficient than identifying the bad ones each time that happens, which I think is not rare.
@pochedls @lee1043 Is there a simple, programmatic way of identifying which models have bad bounds and only dropping and regenerating bounds for those models? It seems more computationally efficient and "correct" compared to dropping bounds for all models regardless if the bounds are correct or not (even if only a few are bad), then regenerating them.
@tomvothecoder I don't know any programmatic way of identifying the issue other than looking at time bounds of each model individually.
I think the simplest, most convenient path is to implement a drop_bounds()
API that automatically identifies the bounds (via CF bounds
key) to drop based on the specified axis/axes. That way the user doesn't need to manually identify the bounds to drop, which can be tedious/error-prone in the case of looping of many models (unless the user implements their own custom function).
This API would be used at the discretion of the user, who should be aware that they need to regenerate bounds using add_bounds()
/add_missing_bounds()
before using other xCDAT APIs.
If both of you think this type API would be generally useful to the broader user community, I'm happy to add it to the backlog as a lower priority enhancement.
@tomvothecoder thank you for giving a thought. That would be definitely useful for PMP, although I don't have a good idea how much broader users would need (or even aware of the issue -- which in this case, having that capability would be helpful).
This seems like a pretty risk-free feature to me. We could add a regenerate
flag if we wanted to regenerate the bounds using xcdat. This also seems like a perfect dev day task. Maybe we should label this as "good first issue" or something?
This seems like a pretty risk-free feature to me. We could add a
regenerate
flag if we wanted to regenerate the bounds using xcdat. This also seems like a perfect dev day task. Maybe we should label this as "good first issue" or something?
The logical flow of drop_bounds()
with a regenerate
type flag makes more sense to me than add_bounds()
with an override
flag.
I updated the title of this issue and tagged it with "good first issue" .
I agree, like the regenerate
idea.
Is your feature request related to a problem?
Per @lee1043 in our 3/15/23 meeting, bounds in the dataset might not always be correct and sometimes users want to override the bounds.
For example, monthly time bounds should be at the beginning and end of the month. If we have a time coordinate
(2000, 1, 1)
:Some datasets might use the incorrect time bounds which can throw off analysis results.
Describe the solution you'd like
Possible Solutions:
add_bounds()
with anoverride
boolean attribute.override=False
-> raises KeyError if bounds existoverride=True
-> override existing bounds for coordinates related to an axisDescribe alternatives you've considered
No response
Additional context
Related to #418 and #434