xarray-contrib / xeofs

Comprehensive EOF analysis in Python with xarray: A versatile, multidimensional, and scalable tool for advanced climate data analysis
https://xeofs.readthedocs.io/
MIT License
105 stars 20 forks source link

Is the reconstruction of unseen data possible using EOF? #156

Closed agirnow closed 8 months ago

agirnow commented 8 months ago

Hi, and thanks again for this great package !

This may be an idiot question, but I do not manage to consider this properly:

As far as I understand, transform method followed by inverse_transform should be able to retrieve the initial dataset whatever the components of the EOF are.

However, using this snippet of code it apperars that it is not true:

import xarray as xr
import pandas as pd
import numpy as np
import xeofs

lat = np.arange(-10, 11, 1) 
lon = np.arange(-20, 21, 1) 
time = pd.date_range("2020-01-01", periods=30, freq="D")
time2 = pd.date_range("2020-02-01", periods=30, freq="D")

# Create random data
data_y = np.random.rand(len(lat), len(lon), len(time))
data_y2 = np.random.rand(len(lat), len(lon), len(time2))

# Create xarray datasets
y = xr.DataArray(data_y, coords=[("latitude", lat), ("longitude", lon), ("time", time)])
y2 = xr.DataArray(data_y2, coords=[("latitude", lat), ("longitude", lon), ("time", time2)])

# MCA apply
eof = xeofs.models.EOF(30)
eof.fit(y-y.mean("time"),dim="time")
scores = eof.transform(y2-y2.mean("time"),normalized=False)
print(np.abs(eof.inverse_transform(scores,normalized=False)-(y2-y2.mean("time"))).mean())

The final line computes the average between the initial dataset and the reconstruction, so I would expect the result to be close to 0, especially because i consider a large number of modes, but it is not. Can anyone help? Sorry if this comes from a misunderstanding from my side. Thanks a lot, Alexandre

nicrie commented 8 months ago

Hey @agirnow, I must admit I'm a bit unsure about this. At first, I wanted to agree with you that it should be possible, but then I quickly compared it again with sklearn's implementation and noticed that it also can't reconstruct unseen data. In fact, the reconstructions from xeofs and sklearn are pretty much in agreement. So, for the moment, I'll pass but will let you know if I gain more insights.

slevang commented 8 months ago

I wonder if this is just because you're using purely random data and the SVD solve is poor. Or something to do with the dimensionality of the arrays you've chosen. I have code doing exactly this with real data and it works well. Using one of the xarray tutorial datasets:

import xarray as xr
import xeofs

da = xr.tutorial.open_dataset("air_temperature").air
y = da.isel(time=slice(None, -10))
y2 = da.isel(time=slice(-10, None))

eof = xeofs.models.EOF(1325)
eof.fit(y, dim="time")
y2_recon = eof.inverse_transform(eof.transform(y2))
np.abs(y2_recon - y2).mean().item()

7.694239375612014e-15
slevang commented 8 months ago

As an aside it would be nice to have an option like n_modes="all", which figures out the rank of the data early on and sets n_modes = rank. When I want to do experiments like this I usually just first try to fit the model with a million modes and then get the rank from the error message and plug it in.

Looks like sklearn.PCA does this as a default: n_components = None

slevang commented 8 months ago

Experimented a little more, it's just the dimensionality of the dataset you're using. You need n_samples>=n_features such that the rank of the SVD is at least n_features to get a perfect reconstruction. So if you increase the time dimension (and n_modes) to 861 in your example above, you get the expected result.

agirnow commented 8 months ago

Thanks everyone ! Indeed, I did not think about the fact that for non-squared matrix, decomposition is limited by the rank of the matrix, which is not sufficient to explain all the variability (especially for random values). It is more clear now.