[Feature]: skipna parameter for averager

lee1043 commented 11 months ago

Is your feature request related to a problem?

skipna=None parameter is being used in xarray's mean function to allow user to decide whether skip NaN values in averaging (thus average will be calculated using non-NaN values) or just return NaN for average when there are any NaN values used.

https://docs.xarray.dev/en/stable/generated/xarray.DataArray.mean.html

Describe the solution you'd like

Convey skipna key to here: https://github.com/xCDAT/xcdat/blob/623814821a748bd2e2acc52971b359550c31913b/xcdat/spatial.py#L737

Similar to temporal average functions when .mean being used.

Describe alternatives you've considered

No response

Additional context

It would be even more helpful if users could set some criteria. For example, letting the user decide the fraction of NaN values.

Let's say, I have 10 values, which include 2 NaNs. I want to get an average with skipna=True. But when having 3 NaN values, I want to average to be NaN.

This is going to help the obs4MIPs process when handling with time-varying NaN values due to missed observation points.

tomvothecoder commented 11 months ago

Thanks for this feature suggestion Jiwoo. I agree, we should have a skipna flag to replicate what Xarray offers.

Additional context

It would be even more helpful if users could set some criteria. For example, letting the user decide the fraction of NaN values.

Let's say, I have 10 values, which include 2 NaNs. I want to get an average with skipna=True. But when having 3 NaN values, I want to average to be NaN.

This is going to help the obs4MIPs process when handling with time-varying NaN values due to missed observation points.

This sounds similar to the weight_threshold feature mentioned here https://github.com/xCDAT/xcdat/issues/531.

Can you provide some pseudo-code? Better yet, a prototype Python implementation would be great.

tomvothecoder commented 9 months ago

I think an alternative solution to skipna is for the user to drop nan values before calculating the average. @pochedls any thoughts for this specific enhancement?

pochedls commented 9 months ago

I'm wondering if this would work. If we were dealing with time series:

ds.time = ["2010-01-01", "2010-02-01", "2010-03-01", "2010-04-01"]
ds.ts = [1, 2, np.nan, 4]

I think dropping the NaN would also drop the time point, which would create problems for a lot of applications. If I instead had a [lat, lon] matrix:

ts = [[1, 2, 3],
      [4, 5, 6],
      [7, np.nan, 9]]

I'm not sure how this would work. What would the ts matrix shape be – it would no longer be a [lat, lon] grid?

Or am I thinking about this the wrong way?

lee1043 commented 6 months ago

@tomvothecoder @pochedls sorry that I haven't fully followed this, but just wondering if there to be any chance to follow upon this as Celine reached out for the same issue -- she wants to operate a spatial average while the data has NaN included.

pochedls commented 6 months ago

@lee1043 – I don't think I can work on this soon. This could be an easy PR (or "dev day" issue) depending on the complexity of the implementation. There might also be work arounds using get_weights (and the computing the mean yourself).

tomvothecoder commented 6 months ago

@tomvothecoder @pochedls sorry that I haven't fully followed this, but just wondering if there to be any chance to follow upon this as Celine reached out for the same issue -- she wants to operate a spatial average while the data has NaN included.

If you or somebody else can provide pseudo-code or a prototype Python implementation it can help speed up the implementation process for whenever @pochedls or I (or somebody else) has time. My dev time for new xCDAT features will be limited for the next few months because of conferences and other priorities.

xCDAT / xcdat