xarray-contrib / xeofs

Comprehensive EOF analysis in Python with xarray: A versatile, multidimensional, and scalable tool for advanced climate data analysis
https://xeofs.readthedocs.io/
MIT License
109 stars 20 forks source link

`scores()` don't match `transform()` with dask data #120

Closed slevang closed 1 year ago

slevang commented 1 year ago

Describe the bug

I'm finding that scores() and transform() don't match very well when passing dask data. I've been able to isolate it to using dask's svd_compressed solver. Things look good if I force dask data through the sklearn randomized_svd pathway.

Reproducible Minimal Working Example

import matplotlib.pyplot as plt
import xarray as xr

from xeofs.models import EOF

X = xr.tutorial.open_dataset("air_temperature").air

# This causes scores to not match transform
X = X.chunk()

model = EOF(n_modes=5, compute=True, check_nans=True, random_state=42)
model.fit(X, dim="time")

scores = model.scores()
transformed = model.transform(X)

xr.concat([scores, transformed, scores - transformed], dim="method").assign_coords(
    method=["scores", "transformed", "delta"]
).plot(col="method")

image

Expected behavior

Without X.chunk(), we route through the sklearn solver and get:

image

The difference is more pronounced on the higher order modes in some examples I've looked at, this one is just an easy example.

@nicrie any ideas?

Desktop (please complete the following information):

nicrie commented 1 year ago

Good catch @slevang! It's not just the scores that are off; the entire decomposition appears to be inaccurate. One thing that matches between Dask's svd_compressed and Scikit-learn's randomized_svd is the total variance. However, the explained variance per mode already differs between them, indicating that it's distributed differently. The same applies to the eigenvectors, which ultimately explains the score differences.

I noticed that when we increase the n_power_iter parameter (which defaults to 0), the differences disappear. According to the documentation:

n_power_iter: int, default=0

Number of power iterations, useful when the singular values decay slowly. Error decreases exponentially as n_power_iter increases. In practice, set n_power_iter <= 4.

With a value of 4, the differences virtually vanish. But even a lower value like 2 already results in differences on the order of about $10^{-5}$, which is likely acceptable for many applications.

In the original paper by Halko et al. (paragraph 1.6), they even suggest using a higher value than 0:

Unfortunately, the simplest version of this scheme is inadequate in many applications because the singular spectrum of the input matrix may decay slowly. To address this difficulty, we incorporate q steps of a power iteration, where q = 1 or q = 2 usually suffices in practice.

Since it's challenging to estimate the error order before computing the singular spectrum, we might consider using a higher default value for n_power_iter. However, this would come at the cost of increased computation time. I ran a quick experiment using your example to see how increasing the number of power iterations affects computation time (see the figure below). While I'm not sure how generalizable these results are, in this example, 2 power iterations increased the computation time by roughly 60%.

output

I'll also explore whether equation 1.10 in the paper can provide us with insights that we can relay to users, suggesting an increase in n_power_iter if the potential error appears to be too large. What are your thoughts on this approach?

nicrie commented 1 year ago

Calculating the reconstruction error $\Delta$ using equation (1.10) seems straightforward, but just a couple of considerations to keep in mind:

Number of Modes: To compute $\Delta$ for a rank-$k$ matrix approximation, we need to include the singular value $k+1$. So, for example, if you want to estimate the error using 5 modes, you'll actually need to compute 6 modes. To address this, I think we have two options: either refactor the code to always compute $k+1$ modes instead of $k,' or compute and report the error for $k-1$ modes.

When to Issue a Warning To obtain a normalized value for $\Delta$ that can be used for any input matrix, independent of the data's scale, I think we can normalize $\Delta$ by the Frobenius norm of the input matrix. This can be calculated as $\sqrt{\kappa (m-1)},$ where $m$ is the number of samples, and $\kappa$ is the total variance (which is already computed and stored in self.data["total_variance"]). The key question here is when to issue a warning.

Taking your previous example, we can calculate Frobenius-normalized reconstruction errors for different values of n_power_iter :

n_power_iter 0 1 2 3 4
$\Delta_{norm}$ 9.81 0.50 0.27 0.21 0.18

I also condcuted similar analyses with different data sets and various normalizations (e.g., standardize=True and use_coslat=True, and the results appear consistent. Generally, it seems that $\Delta_{norm} < 0.5$ provides reasonably accurate results. However, I'd like to establish a less heuristic criterion for issuing warnings.

slevang commented 1 year ago

Very nice, thanks for the quick detective work! I hadn't thought through the math very closely and was mistakenly assuming these quantities should be identical regardless of what comes out of the decomposition, but that's obviously wrong. The dask algorithm we're using is of course an approximation (svd_compressed) and this all makes total sense.

From a user perspective, if the compute time scales less than linearly with increasing n_power, then I think a more conservative default here would make sense (at least 1 or 2 as suggested in the paper). Climate data will often match the "slowly decaying modes" description. I would happily take 2x compute time and be guaranteed a good answer, then could always iterate downwards on this parameter if desire.

An ability to warn the user about an inaccurate decomposition, and a suggestion about how to fix it, is even better. No strong opinions on your points above, either computing an extra mode or computing the error on k-1 modes seems fine. And these types of tolerance definitions are always somewhat heuristic.

One other note is that to maintain the fully lazy execution mode, these type of checks would have to go in the new _post_compute() section I added.

nicrie commented 1 year ago

I was curious about the default number of power iterations used by Scikit-learn's randomized_svd, and it turns out they use either 4 or 7. To maintain consistency, I propose setting the default to 4 as well, and then leave it to the users to choose whether to decrease the number and thus the computational time for decreased accuracy.

From sklearn's documentation:

n_iter int or ‘auto’, default=’auto’

Number of power iterations. It can be used to deal with very noisy problems. When ‘auto’, it is set to 4, unless n_components is small (< .1 * min(X.shape)) in which case n_iter is set to 7. This improves precision with few components. Note that in general users should rather increase n_oversamples before increasing n_iter as the principle of the randomized method is to avoid usage of these more costly power iterations steps. When n_components is equal or greater to the effective matrix rank and the spectrum does not present a slow decay, n_iter=0 or 1 should even work fine in theory (see [1] page 9).