Closed agirnow closed 8 months ago
Hey @agirnow, I must admit I'm a bit unsure about this. At first, I wanted to agree with you that it should be possible, but then I quickly compared it again with sklearn
's implementation and noticed that it also can't reconstruct unseen data. In fact, the reconstructions from xeofs
and sklearn
are pretty much in agreement. So, for the moment, I'll pass but will let you know if I gain more insights.
I wonder if this is just because you're using purely random data and the SVD solve is poor. Or something to do with the dimensionality of the arrays you've chosen. I have code doing exactly this with real data and it works well. Using one of the xarray tutorial datasets:
import xarray as xr
import xeofs
da = xr.tutorial.open_dataset("air_temperature").air
y = da.isel(time=slice(None, -10))
y2 = da.isel(time=slice(-10, None))
eof = xeofs.models.EOF(1325)
eof.fit(y, dim="time")
y2_recon = eof.inverse_transform(eof.transform(y2))
np.abs(y2_recon - y2).mean().item()
7.694239375612014e-15
As an aside it would be nice to have an option like n_modes="all"
, which figures out the rank of the data early on and sets n_modes = rank
. When I want to do experiments like this I usually just first try to fit the model with a million modes and then get the rank from the error message and plug it in.
Looks like sklearn.PCA does this as a default: n_components = None
Experimented a little more, it's just the dimensionality of the dataset you're using. You need n_samples>=n_features
such that the rank of the SVD is at least n_features
to get a perfect reconstruction. So if you increase the time dimension (and n_modes
) to 861 in your example above, you get the expected result.
Thanks everyone ! Indeed, I did not think about the fact that for non-squared matrix, decomposition is limited by the rank of the matrix, which is not sufficient to explain all the variability (especially for random values). It is more clear now.
Hi, and thanks again for this great package !
This may be an idiot question, but I do not manage to consider this properly:
As far as I understand, transform method followed by inverse_transform should be able to retrieve the initial dataset whatever the components of the EOF are.
However, using this snippet of code it apperars that it is not true:
The final line computes the average between the initial dataset and the reconstruction, so I would expect the result to be close to 0, especially because i consider a large number of modes, but it is not. Can anyone help? Sorry if this comes from a misunderstanding from my side. Thanks a lot, Alexandre