xarray-contrib / xskillscore

Metrics for verifying forecasts
https://xskillscore.readthedocs.io/en/stable/
Apache License 2.0
222 stars 38 forks source link

how to apply crps over multiple dims? #93

Closed raybellwaves closed 4 years ago

raybellwaves commented 4 years ago

in the read me we don't specify the dim and it returns the shame same as obs

import xarray as xr
import numpy as np
import xskillscore as xs

### Probabilistic
obs3 = xr.DataArray(
    np.random.rand(4, 5),
    coords=[np.arange(4), np.arange(5)],
    dims=["lat", "lon"]
)
fct3 = xr.DataArray(
    np.random.rand(3, 4, 5),
    coords=[np.arange(3), np.arange(4), np.arange(5)],
    dims=["member", "lat", "lon"],
)

crps_ensemble = xs.crps_ensemble(obs3, fct3, dim='metric')

In other metrics you can specify the dim as

xs.pearson_r(obs3, fct3.isel(member=0), dim='lat')
# Shape lon

To get crps of shape lon do I just specify what goes into the metric?

xs.crps_ensemble(obs3.isel(lat=0), fct3.isel(lat=0))

I know dim in this metric refers to the ensemble dimension dim='member' https://github.com/raybellwaves/xskillscore/blob/master/xskillscore/core/probabilistic.py#L92

I know there has been discussion on this before https://github.com/TheClimateCorporation/properscoring/issues/6

aaronspring commented 4 years ago

do you want to apply crps on a dim other than member?

raybellwaves commented 4 years ago

I agree I think we should always apply over member. In my example above to return shape lon it would be over lat and member

I think I want to stack similar to https://github.com/raybellwaves/xskillscore/blob/46fa8528741a0dcc6caf4e55a6079c3120766711/xskillscore/core/deterministic.py#L82 ... will come back to this

Braining storming: I'm just curious if in my example above (if that is correct)

xs.crps_ensemble(obs3.isel(lat=0), fct3.isel(lat=0))

could be written under the hood similar to other functions

def crps_ensemble(a, b, dim):
# Raise Error if member not in dim...

# reshape data to match user specified dim
a = a.isel(dim=0)
b = b.isel(dim=0)
# continue with existing code...

xs.crps_ensemble(obs3, fct3, dim=['lat', 'member'])

then the new method: xs.crps_ensemble(obs3, fct3, 'member') should be same as old method xs.crps_ensemble(obs3, fct3)

aaronspring commented 4 years ago

I dont understand what you mean by lon shape

aaronspring commented 4 years ago

I dont understand why xs.crps_ensemble(obs3, fct3, dim=['lat', 'member'])

raybellwaves commented 4 years ago

I dont understand what you mean by lon shape

Apply crps over lat and member.

I dont understand why xs.crps_ensemble(obs3, fct3, dim=['lat', 'member'])

Similar to other metrics e.g. for apply pearson_r over lat and lon xs.pearson_r(obs3, fct3.isel(member=0), dim=['lat', 'lon'])

Goes back to the discussion we had of implementing multi-dim for deterministic: why?

Guess you could do crps over Nino3.4 region = xs.crps_ensemble(obs3, fct3, dim=['lat', 'lon', 'member'])?

aaronspring commented 4 years ago

You can average over lon lat after apply to member. I think crps is only well defined over member.

raybellwaves commented 4 years ago

You can average over lon lat after apply to member. I think crps is only well defined over member.

Probably best suggestion. Thanks

aaronspring commented 4 years ago

But probably averaging before and then calculating skill is also an option

raybellwaves commented 4 years ago

But probably averaging before and then calculating skill is also an option

Was just going that ask. You got a preference? I feel average before is probably more interpretable

import xarray as xr
import numpy as np
import pandas as pd
import xskillscore as xs

o = xr.DataArray(
    np.random.rand(3, 4, 5),
    coords=[
        pd.date_range("1/1/2000", "1/3/2000", freq="D"),
        np.arange(4),
        np.arange(5),
    ],
    dims=["time", "lat", "lon"],
)
f = xr.DataArray(
    np.random.rand(3, 4, 5, 6),
    coords=[
        pd.date_range("1/1/2000", "1/3/2000", freq="D"),
        np.arange(4),
        np.arange(5),
        np.arange(6)
    ],
    dims=["time", "lat", "lon", "member"],
)

o_avg = o.mean(dim=['lat', 'lon'])
f_avg = f.mean(dim=['lat', 'lon'])
xs.crps_ensemble(o_avg, f_avg, dim='member')

crps = xs.crps_ensemble(o, f, dim='member')
crps.mean(dim=['lat', 'lon'])

values are on order of magnitude larger in crps.mean(dim=['lat', 'lon'])

>>> xs.crps_ensemble(o_avg, f_avg, dim='member')
<xarray.DataArray (time: 3)>
array([0.00819853, 0.04241865, 0.03740562])
Coordinates:
  * time     (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03
>>> crps = xs.crps_ensemble(o, f, dim='member')
>>> 
>>> crps.mean(dim=['lat', 'lon'])
<xarray.DataArray (time: 3)>
array([0.16495624, 0.18920637, 0.1971401 ])
Coordinates:
  * time     (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03
>>>
aaronspring commented 4 years ago

I would say it depends on what you want to check:

aaronspring commented 4 years ago

I wouldnt use the dim args in crps though:

crps = xs.crps_ensemble(o, f, dim='member')

This implies that you can calculate crps_ensemble also over dim other than member which is not defined. I would also suggest to take this out of the README example.

raybellwaves commented 4 years ago

@aaronspring I agree. welcome to to make a PR

dougiesquire commented 4 years ago

The member dimension is a "special" dimension in the calculation of probabilistic metrics. Often "applying over multiple dims" for these metrics simply means "averaging metric over multiple dims" (for example, this is the case for crps as @aaronspring mentions above).

Given this, is it worth distinguishing between a "member_dim" and "dim" input to the probabilistic functions?

aaronspring commented 4 years ago

probably yes. either we require member as the member dim (as currently) or have an explicit keyword for this in every function member_dim='member'.

aaronspring commented 4 years ago

So we could add dim to the args in crps and pass all dim item expect member into a mean function?