[Enhancement]: Add `ds.bounds.drop_bounds()` API for programmatically dropping bounds via CF `bounds` key

tomvothecoder commented 1 year ago

Is your feature request related to a problem?

Per @lee1043 in our 3/15/23 meeting, bounds in the dataset might not always be correct and sometimes users want to override the bounds.

For example, monthly time bounds should be at the beginning and end of the month. If we have a time coordinate (2000, 1, 1):

# Time coordinate with incorrect time bounds
(2000, 1 , 1)  -> [(1999, 12, 16), (2000, 1, 16)]

# Time coordinate with correct time bounds
(2000, 1 , 1)  -> [(2000, 1, 1), (2000, 1, 31)]

Some datasets might use the incorrect time bounds which can throw off analysis results.

Describe the solution you'd like

Possible Solutions:

The user drops existing bounds manually and adds xCDAT bounds

ds = xcdat.open_mfdataset("path")

# User needs to know the name of the time bounds to drop. If they are looping over many datasets, this might be tricky because the name can vary. Users can probably write a function with `cf_xarray` to do this.
ds = ds.drop_vars("time_bnds")

# Option 1
ds = ds.bounds.add_bounds(axis="T")
# Option 2
ds = ds.bounds.add_missing_bounds(axis="T")

# With both options, the frequency for the time bounds is inferred (e.g., set monthly time bounds if xCDAT infers monthly time coordinates) pending PR #418 and #434

Extend add_bounds() with an override boolean attribute.
- If override=False -> raises KeyError if bounds exist
- If override=True -> override existing bounds for coordinates related to an axis

Describe alternatives you've considered

No response

Additional context

Related to #418 and #434

tomvothecoder commented 1 year ago

Just opening this issue to have it in our backlog for further consideration

lee1043 commented 7 months ago

@tomvothecoder As now we can do like this (no specification of axis): ds = ds.bounds.add_missing_bounds()

How about extending add_bounds() with an override boolean attribute with below options:

default: override=False -> add missed bounds, but without raising key error if bounds exist
If override=True -> override existing bounds for coordinates related to an axis

pochedls commented 7 months ago

Just weighing in that I usually lean toward convenience, but in this case I'm not so sure. It could lead users to systematically ignore the bounds that the dataset creator took the time to add. Could this be justified with some analysis (e.g., when we drop bounds, xcdat generally replaces the bounds with the correct values AND in some cases it actually leads to better bounds)?

tomvothecoder commented 7 months ago

Just weighing in that I usually lean toward convenience, but in this case I'm not so sure. It could lead users to systematically ignore the bounds that the dataset creator took the time to add. Could this be justified with some analysis (e.g., when we drop bounds, xcdat generally replaces the bounds with the correct values AND in some cases it actually leads to better bounds)?

For users who analyze individual datasets at a time, override=True might be a good option if they have the knowledge that the bounds in their datasets are not as "good" as the ones produced by xCDAT.

On the other hand, for package developers, override=True might cause issues because xCDAT will always override existing bounds for whatever dataset(s) passed regardless if the existing bounds are already "correct". Some package devs might use override=True without understanding the consequences of this assumption in their APIs. I think this is related to what you're saying by "it could lead users to systematically ignore the bounds that the dataset creator took the time to add." It can also apply to users who loop through many datasets at a time in their analysis work and set override=True.

tomvothecoder commented 7 months ago

@pochedls @lee1043 do you guys typically use bounds from the dataset as the default option in most cases (if they exist) or mainly use xCDAT's add bounds/add missing bounds APIs.

My thought was that the dataset bounds should be the higher priority for most cases.

lee1043 commented 7 months ago

My regular code flow in the PMP is like below. As I am dealing with multiple models knowing some models may have their bounds info missed, I use add_missing_bounds for all models to fill the missed ones if there is any.

for model in models:
  # open
  ds = xc.open_dataset(model_files)
  ds = ds.bounds.add_missing_bounds()
  # Then analyze
  ....

pochedls commented 7 months ago

@lee1043 – this will add the missing bounds, but this ticket is advocating for dropping the dataset bounds and then having xcdat trying to infer them.

lee1043 commented 7 months ago

Oh,,,, @pochedls thanks for clarifying, I must have confused it with the add_missing_bounds! Sorry for the confusion. What I suggested here is about the add_missing_bounds, not add_bounds.

tomvothecoder commented 7 months ago

Oh,,,, @pochedls thanks for clarifying, I must have confused it with the add_missing_bounds! Sorry for the confusion. What I suggested here is about the add_missing_bounds, not add_bounds.

I think the same scenarios and questions from the comments above will apply to add_missing_bounds() too.

My understanding is that any existing bounds take priority over xCDAT generated bounds. If the existing bounds are incorrect for some datasets, it would be a data quality issue that should probably be addressed explicitly by the user rather than systematically dropping and replacing bounds using override=True (especially to many datasets in a for loop).

If package devs use add_missing_bounds(override=True) in their APIs, users might not be aware of this implicit behavior going on behind the scenes which might be a bit dangerous.

lee1043 commented 7 months ago

@tomvothecoder agreed, thank you for clarifying!

tomvothecoder commented 7 months ago

I was considering the idea of a ds.bounds.drop_bounds() API for convenience and special cases where bounds are incorrect. It could be nice to quickly drop bounds without having to use other xCDAT ds.bounds.keys to get the bounds key(s) to drop with Xarray. However, it might inadvertently encourage users to drop existing bounds in favor of xCDAT bounds for all of their use cases, similar to what was mentioned above. Users also might drop_bounds() then forget to add_bounds()/add_missing_bounds(), which will result in xCDAT APIs throwing errors about missing bounds.

I think we can leave this type of data quality issue in the hands of the users. @pochedls and @lee1043 I'll close this issue if we're all on the same page here. Thanks!

lee1043 commented 7 months ago

@tomvothecoder I am okay with closing the issue, thank you for the discussion.

pochedls commented 7 months ago

I know this is closed, but if there were obvious cases where we should drop a dataset's bounds and re-generate them it would be useful to document them (in case we wanted to consider an override feature in the future).

tomvothecoder commented 7 months ago

@pochedls Sure, we can re-open this issue to gather ideas. If we don't have any substantial reason to have this feature, then we can re-close it.

lee1043 commented 7 months ago

I know this is closed, but if there were obvious cases where we should drop a dataset's bounds and re-generate them it would be useful to document them (in case we wanted to consider an override feature in the future).

I think I have had such cases that I wanted to drop and regenerate bounds systematically for collective operation across different models because some models' bounds information was not correctly generated. @tomvothecoder I like the idea of drop_bounds() and as @pochedls said we should document that in the API doc for the drop_bounds(). How about that?

pochedls commented 7 months ago

I was suggesting documenting the examples where this is an issue (before beginning work on this). I think I've seen this in FGOALS, but I think it is pretty rare. If you are routinely dropping bounds and re-generating them, it would be helpful to demonstrate that you fix more problems than you introduce. If you loop over 50 models, do you reproduce the correct bounds for the vast majority of models that are fine (while fixing the rare model that has bad bounds)?

lee1043 commented 7 months ago

Below is one example case from CMIP5, ACCESS 1-0 model, whose time bounds are not aligned with time step.

input_file = "ts_Amon_ACCESS1-0_historical_r1i1p1_185001-200512.nc"
ds = xc.open_dataset(input_file)
ds["time"]
ds["time_bnds"]

In such case, maybe something like drop_time_bounds() and add_time_bounds() might be helpful if exist.

If I loop over 50 models, I might reproduce the correct bounds for the vast majority of models that are fine while fixing the rare model that has bad bounds. But I feel that would be more efficient than identifying the bad ones each time that happens, which I think is not rare.

My workaround considering such cases is like this:

def reset_time_bounds(ds, freq="month"):
    time_bnds_key = get_time_bounds_key(ds)
    return ds.drop_vars([time_bnds_key]).bounds.add_time_bounds("freq", freq=freq)

Note: get_time_bounds_key()

tomvothecoder commented 1 month ago

Just following up after some time of inactivity in this ticket.

If I loop over 50 models, I might reproduce the correct bounds for the vast majority of models that are fine while fixing the rare model that has bad bounds. But I feel that would be more efficient than identifying the bad ones each time that happens, which I think is not rare.

@pochedls @lee1043 Is there a simple, programmatic way of identifying which models have bad bounds and only dropping and regenerating bounds for those models? It seems more computationally efficient and "correct" compared to dropping bounds for all models regardless if the bounds are correct or not (even if only a few are bad), then regenerating them.

pochedls commented 1 month ago

I'm not sure...I think suspicious bounds could be identified, but in the example @lee1043 gives above – I'm pretty sure they are wrong, but I don't know they are wrong. I don't really care if people systematically remove/regenerate the bounds as long as it is an option that is off by default (i.e., requires the user to do something to produce this behavior).

lee1043 commented 1 month ago

Just following up after some time of inactivity in this ticket.

If I loop over 50 models, I might reproduce the correct bounds for the vast majority of models that are fine while fixing the rare model that has bad bounds. But I feel that would be more efficient than identifying the bad ones each time that happens, which I think is not rare.

@pochedls @lee1043 Is there a simple, programmatic way of identifying which models have bad bounds and only dropping and regenerating bounds for those models? It seems more computationally efficient and "correct" compared to dropping bounds for all models regardless if the bounds are correct or not (even if only a few are bad), then regenerating them.

@tomvothecoder I don't know any programmatic way of identifying the issue other than looking at time bounds of each model individually.

tomvothecoder commented 1 month ago

I think the simplest, most convenient path is to implement a drop_bounds() API that automatically identifies the bounds (via CF bounds key) to drop based on the specified axis/axes. That way the user doesn't need to manually identify the bounds to drop, which can be tedious/error-prone in the case of looping of many models (unless the user implements their own custom function).

This API would be used at the discretion of the user, who should be aware that they need to regenerate bounds using add_bounds()/add_missing_bounds() before using other xCDAT APIs.

tomvothecoder commented 1 month ago

If both of you think this type API would be generally useful to the broader user community, I'm happy to add it to the backlog as a lower priority enhancement.

lee1043 commented 1 month ago

@tomvothecoder thank you for giving a thought. That would be definitely useful for PMP, although I don't have a good idea how much broader users would need (or even aware of the issue -- which in this case, having that capability would be helpful).

pochedls commented 1 month ago

This seems like a pretty risk-free feature to me. We could add a regenerate flag if we wanted to regenerate the bounds using xcdat. This also seems like a perfect dev day task. Maybe we should label this as "good first issue" or something?

tomvothecoder commented 1 month ago

This seems like a pretty risk-free feature to me. We could add a regenerate flag if we wanted to regenerate the bounds using xcdat. This also seems like a perfect dev day task. Maybe we should label this as "good first issue" or something?

The logical flow of drop_bounds() with a regenerate type flag makes more sense to me than add_bounds() with an override flag.

I updated the title of this issue and tagged it with "good first issue" .

lee1043 commented 1 month ago

I agree, like the regenerate idea.

xCDAT / xcdat