xarray-contrib / xeofs

Comprehensive EOF analysis in Python with xarray: A versatile, multidimensional, and scalable tool for advanced climate data analysis
https://xeofs.readthedocs.io/
MIT License
98 stars 18 forks source link

why the explained_variance_ratio of CCA so small #146

Closed singledoggy closed 7 months ago

singledoggy commented 8 months ago

Example

I use my own data and get a extremly low explained_variance_ratio, so I use the example data like:

import xarray as xr
import xeofs as xe

import matplotlib.pyplot as plt
from matplotlib.gridspec import GridSpec
import cartopy.crs as ccrs
sst = xr.tutorial.load_dataset("ersstv5").sst
sst = sst.groupby("time.month") - sst.groupby("time.month").mean("time")
indian = sst.sel(lon=slice(35, 115), lat=slice(30, -30))
pacific = sst.sel(lon=slice(130, 290), lat=slice(30, -30))
atlantic = sst.sel(lon=slice(320, 360), lat=slice(70, 10))

data_list = [indian, pacific, atlantic]

model = xe.models.CCA(
    n_modes=2,
    use_coslat=True,
    pca=True,
    variance_fraction=0.9,
    init_pca_modes=0.30,
)
model.fit(data_list, dim="time")
components = model.components()
scores = model.scores()

and the model.explained_variance_ratio() are so small.

[<xarray.DataArray (mode: 2)>
 array([0.00583575, 0.00906892])
 Coordinates:
   * mode     (mode) int64 1 2
 Attributes: (12/16)
     model:          EOF analysis
     software:       xeofs
     version:        2.2.4
     date:           2024-01-22 22:57:11
     n_modes:        187
     center:         True
     ...             ...
     feature_name:   feature
     random_state:   None
     verbose:        False
     compute:        True
     solver:         auto
     solver_kwargs:  {},

Question

If I decrease init_pca_modes=0.30, the warning message states that "variance fraction 0.9000 is not reached. Only 0.7529 of variance is explained." It means the variance in the preprocess step of PCA.

Does the .explained_variance_ratio() here mean anything like in EOFs? I assum it should explain the variance of indian, pacific, atlantic respectively, but it's not that case.

nicrie commented 7 months ago

First off, my apologies for the late reply - I had an extremely busy week till now.

So, I cannot claim to be an expert on CCA, I only used it myself from time to time. In xeofs, the CCA's implementation is based on and tested against the PCACCA implementation of the CCA-Zoo package.

All I can say right now is that your conception of explained variance (ratio) should hold for CCA as you know it from PCA. The only thing to keep in mind is, that CCA maximizes correlation between different fields, which does not necessarily imply high amount of explained variance. That being said I also find the amount of explained variance in the given example very low. I'll try to double check the coming week and keep you posted on this @singledoggy

nicrie commented 7 months ago

I just pushed a patch that should fix the incorrect computation of explained variance. Updating to the newest version should resolve the problem for you. Please let me know if it worked.

Also, from my limited experience with CCA, I can say that increasing the regularization (either by reducing the variance_fraction or by increasing the ridge coefficient c) can help increasing the explained variance.

singledoggy commented 7 months ago

Thank you for your prompt response. I appreciate the additional information you provided. It is important to note that the use of CCA does not necessarily guarantee a high level of explained variance. In fact, some articles do not even report the explained variance.

The explained variance in the current version is much more reasonable compared to the previous one.

[<xarray.DataArray (mode: 2)>
 array([0.21777254, 0.04709501])
 Coordinates:
   * mode     (mode) int64 1 2
 Attributes:
     long_name:     Monthly Means of Sea Surface Temperature
     units:         degC
     var_desc:      Sea Surface Temperature
     level_desc:    Surface
     statistic:     Mean
     dataset:       NOAA Extended Reconstructed SST V5
     parent_stat:   Individual Values
     actual_range:  [-1.8     42.32636]
     valid_range:   [-1.8 45. ],
 <xarray.DataArray (mode: 2)>
 array([0.10384682, 0.09979495])
 Coordinates:
   * mode     (mode) int64 1 2
 Attributes:
     long_name:     Monthly Means of Sea Surface Temperature
     units:         degC
     var_desc:      Sea Surface Temperature
     level_desc:    Surface
     statistic:     Mean
     dataset:       NOAA Extended Reconstructed SST V5
     parent_stat:   Individual Values
     actual_range:  [-1.8     42.32636]
     valid_range:   [-1.8 45. ],
 <xarray.DataArray (mode: 2)>
 array([0.15104585, 0.06363411])
 Coordinates:
   * mode     (mode) int64 1 2
 Attributes:
     long_name:     Monthly Means of Sea Surface Temperature
     units:         degC
     var_desc:      Sea Surface Temperature
     level_desc:    Surface
     statistic:     Mean
     dataset:       NOAA Extended Reconstructed SST V5
     parent_stat:   Individual Values
     actual_range:  [-1.8     42.32636]
     valid_range:   [-1.8 45. ]]
nicrie commented 7 months ago

Glad it helped - closing this.