xarray-contrib / xeofs

Comprehensive EOF analysis in Python with xarray: A versatile, multidimensional, and scalable tool for advanced climate data analysis
https://xeofs.readthedocs.io/
MIT License
98 stars 18 forks source link

CCA with nan values #177

Open jiwon-j opened 1 month ago

jiwon-j commented 1 month ago

Hello, now i'm trying to conduct CCA with two datasets, SST(ERA5) and precipitation(GPCP). i'm following this example(https://xeofs.readthedocs.io/en/latest/auto_examples/2multi/plot_cca.html), However, I get an error saying that the array should not contain infs or NaNs. I realize that the SST data used in the example code is a different ERSST than the one I used, but the SST data will naturally have a nan value for land, and that's what the result plot in the example shows. Do I need to pre-process the nan values before CCA?

nicrie commented 1 month ago

hey did you verify that you only have full-dimensional NaNs within your data? Grid cells representing land in SST data sets are full-dimensional and not a problem. However, there may be some hidden, individual NaNs within your data.

Can you post the exact error message?

jiwon-j commented 1 month ago

File ~/miniconda3/lib/python3.10/site-packages/xeofs/models/cca.py:312, in CCA._fit_algorithm(self, views) 309 [assert_not_complex(view) for view in views] 311 self.c = _process_parameter("c", self.c, 0, self.nviews) --> 312 eigvals, eigvecs = self._solve_gevp(views) 313 self.eigvals = eigvals 314 self.eigvecs = eigvecs ... --> 630 raise ValueError( 631 "array must not contain infs or NaNs") 632 return a

ValueError: array must not contain infs or NaNs

it show like above. but when i count the number of nan values for each year in dataset, all numbers are same. So I don't think this dataset has any nan values other than the nan value for land area. (I also checked that there was no inf value.)

nicrie commented 1 month ago

Would it be possible for you to share a minimal working example that shows the error? Perhaps you can isolate the error for a small subset of the data that you can share so I can have a look at it?

jiwon-j commented 1 month ago
data_list = [precip_xr, SST_xr]

model = xeofs.models.CCA(
    n_modes=2,
    use_coslat=True,
    pca=True,
    variance_fraction=0.9,
    init_pca_modes=0.30,
)
model.fit(data_list, dim="time")
components = model.components()
scores = model.scores()

two datas have same time steps. I just tried inputting only precipitation data and then only SST data (ex, data_list=[SST_xr, SST_xr] and data_list=[precip_xr, precip_xr]) and no errors appeared. Could it be problem that the two datasets are different and have different resolutions?

nicrie commented 1 month ago

Thank you for sharing the example. In theory, different spatial resolutions do not matter in the context of CCA.

I'm not immediately sure what the underlying issue might be. Could you provide more details about the dataset sizes, specifically the number of samples and grid points for both PRCP and SST? Have you experimented with the variance_fraction parameter? There might be edge cases where the PCA preprocessing selects either only one principal component or too many (exceeding the number of samples), leading to an ill-posed CCA problem and possibly resulting in NaNs.

I'd be happy to help you identify the issue :) But to assist you more effectively, I would need a fully reproducible example, including the data that demonstrates the error.

jiwon-j commented 1 month ago

thank you for your caring

  1. about dataset GPCP precipitation dataset have resolution of 2.5 x 2.5. In the domain I defined, the data has a total of 784 grids, each with 28 steps of latitude and longitude. and ERA5 SST dataset have resolution of 0.25 x 0.25. i have 801 longitude steps and 561 latitude steps, so it have total of 449361 grid points.

  2. at first, i set the variance_fraction to 0.9, but it still return the error 'ValueError: array must not contain infs or NaNs' whenever i set the option to the range of 0.1~0.9.

nicrie commented 1 month ago

Thanks for the info! What are the number of time steps in both data sets ?

jiwon-j commented 1 month ago

it was 20 same for both

nicrie commented 1 month ago

ok in that case it may be that you're indeed running into one of these edge cases mentioned above. Would you mind testing how many modes you need for each data set to explain at least 90 % of your variance? For this you want to fit an EOF(n_modes=20, standardize=False, use_coslat=True) model on both data sets separately and then print the model.explained_variance_ratio() where model is either SST or PRCP. Could you report back with the output for both models?

jiwon-j commented 1 month ago
  1. GPCP precipitation
model = xeofs.models.EOF(n_modes=20,
                        standardize=False,
                        use_coslat=True,)
model.fit(precip_P1_xr, dim='time')
model.explained_variance_ratio()

array([2.15341437e-01, 1.91357740e-01, 1.07027890e-01, 6.50667319e-02,
       6.04127807e-02, 5.37885277e-02, 4.60878220e-02, 3.65153038e-02,
       3.32812623e-02, 2.91941211e-02, 2.61814632e-02, 2.43078690e-02,
       2.32423052e-02, 1.88798822e-02, 1.86685204e-02, 1.50839098e-02,
       1.30147616e-02, 1.25216876e-02, 1.00259842e-02, 2.05406409e-16])

this data need at least 13 modes.

  1. ERA5 SST
    
    model = xeofs.models.EOF(n_modes=20,
                        standardize=False,
                        use_coslat=True,)
    model.fit(SST_P1_xr, dim='time')
    model.explained_variance_ratio()

array([4.18731037e-01, 9.03651828e-02, 7.90773958e-02, 5.90202030e-02, 5.12331694e-02, 4.36145692e-02, 3.98111592e-02, 3.75551818e-02, 2.69943871e-02, 2.56492026e-02, 2.26891458e-02, 1.77368605e-02, 1.70833855e-02, 1.59150489e-02, 1.38574669e-02, 1.27135494e-02, 1.11173236e-02, 8.87997290e-03, 7.95575824e-03, 7.39957211e-16])



SST need at least 12 modes to explain 90% of variance.
nicrie commented 1 month ago

I'm not so sure about this anymore... Setting variance_fraction=0.5 should result in selecting 3 and 2 principal components, which seems fine. I don't see anything unusual in the variance_fraction range of 0.1 to 0.9 that you mentioned earlier. I fear that without a fully reproducible example, including the data, I can't provide further help. Sorry about that.

jiwon-j commented 1 month ago

That's okay, thanks for your help in the meantime. I'll try a few more things and update if the issue is resolved.

nicrie commented 1 week ago

hey @jiwon-j just in case you're still facing the same problem, you may give the new (xeofs version 3) xe.cross.CCA class a try and see if it solves the issue you had? This implementation of CCA is simpler (only allows 2 sets of data), but it's likely more robust than the one you used in version 2.