wiskott-lab / sklearn-sfa

This project provides Slow Feature Analysis as a scikit-learn-style package.
BSD 3-Clause "New" or "Revised" License
33 stars 6 forks source link

Time-lagged slow feature analysis #5

Open sbhakat opened 1 year ago

sbhakat commented 1 year ago

I am trying to implement a demo version of time-lagged slow feature analysis:

import numpy as np

def slow_feature_analysis(X, tau, n_components):
    # Compute the difference matrix
    D = X[:, tau:] - X[:, :-tau]

    # Perform SVD on the difference matrix
    U, S, Vt = np.linalg.svd(D, full_matrices=False)

    # Return the first n_components columns of U
    return U[:, :n_components]

# Example usage:
X = np.random.randn(1000, 100)  # random input data
S = slow_feature_analysis(X, tau=10, n_components=3) 

Two questions:

  1. Am I thinking correctly?
  2. Is there anyway we can implement this within sklearn SFA ?

Any help will be highly appreciated!

Stewori commented 1 year ago

May I ask how "time-lagged" SFA is specified? Is there a paper or chapter about it? What do you hope to achieve by "time-lagged" SFA? Your term D = X[:, tau:] - X[:, :-tau] looks much like SFA's derivative estimate. There we use (x(t) - x(t-1)) as an estimate for the derivative x'. That is, assuming the signal is sampled by some time resolution the closest one can trivially get to a differential coefficient. Since time-scale is arbitrary, the time delta is in a sense normalized to 1, i.e. x(t-1) really stands for the closest-in-time available data before t. Therefore, I do not see what can be gained in using tau instead of 1. For tau < 1 no data would be available unless one interpolates, but your approach does not look like that. For tau > 1 the derivative approximation would be unnecessarily compromised.

That said, it is interesting that you use SVD on the data matrix (rather than on the covariance). I haven't seen such an approach to SFA before, however I am not sure if it is beneficial. SVD is dominated by the smaller dimension, the dimension of the data in your case. But the length of the training phase would participate as a linear cost factor AFAIK (typically length >> data dim). That's of course similar to the computation of the covariance matrix in usual SFA, so it should not differ much in cost. Your variant would probably not permit data processing in chunks. Also I do not see how you accomplish sphering of the input signal.

sbhakat commented 1 year ago

Thanks @Stewori . Not a SFA expert but here is my thinking:

  1. We have a time-series data from MD simulation. A lag time has often used to process time-series data from MD simulation in order to show that an observable is Markovian. Hence I am interested in having a keyword tau . It should be default 1 but if user wants it can be modified. Something like: https://github.com/msmbuilder/msmbuilder/blob/master/msmbuilder/decomposition/tica.py (see lag time) More on lag time here: http://docs.markovmodel.org/lecture_tica.html

  2. I had no better way to show what mean by lag-time hence I used SVD. In the best case scenario I would have used sklearn SFA implementation.

Stewori commented 1 year ago

C(tau) in the linked article is an auto correlation matrix rather than the covariance matrix of the derivative, which SFA uses. I.e. <x(t) x(t-1)^T> (TICA) rather than <(x(t)-x(t-1)) (x(t)-x(t-1))^T> (SFA). Yes, there is the relation (x(t)-x(t-1)) (x(t)-x(t-1))^T = x(t)x(t)^T - x(t)-x(t-1)^T - x(t-1)x(t)^T + x(t-1)x(t-1)^T, so auto correlation matrices in a way do participate in SFA. However, I doubt that relation can be exploited for implementing TICA on top of SFA. Even if it could, that would be a cumbersome TICA implementation, probably with hampered efficiency. Doesn't there exist a TICA node for scikits readily? Otherwise I would recommend to leverage a proper TICA implementation and use a wrapper node if integration with scikit is strictly required. Or you just check out SFA instead of TICA; it is not so bad for dimensionality reduction as it stands :)

MoritzLange commented 1 year ago

Hey sbhakat, (and hey Stewori - since I already typed most of this answer, I'm gonna send it now anyway. It overlaps with parts of your answer you sent just now :) )

if I see this correctly, tICA is similar to PCA, only that it calculates the covariance matrix between features and their time-lagged versions. So instead of finding decorrelated dimensions with largest variance (PCA) it finds decorrelated dimensions with the largest covariance between original and time-lagged version of the feature combination. In other words, it finds orthogonal feature combinations that have the largest autocorrelation for a specified lag.

This idea is somewhat unrelated to SFA, which simply tries to find features combinations that vary the most slowly. It does this by "performing PCA" on temporal differences but looking for smallest rather than largest variance. SFA in itself has no concept of autocorrelation, but imagine a feature that changes periodically and very slowly, and a feature that changes periodically and quickly. The autocorrelation of the slow feature will be large for big time lags ($\tau$), and the autocorrelation of the quick feature will be large for smaller $\tau$, right? In this sense you could perhaps think of SFA as a method to find feature combinations with large $\tau$ rather than large values of autocorrelation as tICA does.

Introducing $\tau$ to SFA does therefore not make much sense, I think. It would make patterns in the data with wavelength $\tau$ or whole fractions thereof invisible to SFA (since the difference $x(t) - x(t+\tau)$ will be constant, so there's apparently no change in the data). So in effect SFA would act as if these patterns were not there. So what I think would happen is that you kind of remove autocorrelation for lag $\tau$ from the data, and then again search for the feature combinations that change the slowest. These will likely be the same, except for potentially slightly minor changes in slowness of feature combinations which used to contain patterns that now became invisible.

This is just my personal understanding however.

Regardless, if you want to try using SFA with a lag, this is the line where the difference is calculated for the standard SFA.fit() function. You can clone the repository and modify the code. Simply use tau instead of 1 there.

Stewori commented 1 year ago

Having given it some more thought, it occurs to me that SFA (discrete SFA that approximates $\dot{z}(t) \approx z(t)-z(t-1)$ ) actually is (approximately) equivalent to tICA with tau=1. Let me explain. Set $Z{ij} = \langle z(t-i) z^T(t-j)\rangle$. Because of sphering, one has in SFA $Z{00} = In$ (the n-dimensional identity matrix). If the training sample is large enough, one has approximately $Z{11} \approx I_n$ as well. Let $A_m$ denote a partial orthogonal extraction matrix (n x m) for extracting m < n components. $A_m$ consists of m orthogonal columns and extracts the signal $A_m^T z$. Because of sphering one has $Am^T Z{00} A_m = I_m$ and as noted above $Am^T Z{11} A_m \approx I_m$.

SFA wants to find $A_m$ that preserves the $m$ smallest eigenvalues/eigenspaces of the matrix

$\langle A_m^T (z(t)-z(t-1)) (z(t)-z(t-1))^T A_m \rangle$

$= Am^T Z{00} A_m - Am^T Z{01} A_m - Am^T Z{10} A_m + Am^T Z{11} A_m$

$\approx 2 I_m - Am^T (Z{01} + Z_{10}) A_m$

Since $2 I_m$ is constant and thus not relevant for optimization, choosing $A_m$ to minimize $\langle A_m^T (z(t)-z(t-1)) (z(t)-z(t-1))^T A_m \rangle$ is approximately the same as choosing $A_m$ to maximize $Am^T (Z{01} + Z_{10}) Am$. Therein "approximately" merely refers to $Z{11}$ not being exactly the identity given sphering over a finite sample.

If I am not mistaken, the former is the SFA objective and the latter is the tICA objective.

If that's correct, then yes, you should be able to use SFA instead of tICA and vice versa (for tau=1). I am not the maintainer of this node, but I recommend to make tau a proper parameter. That is a trivial adjustment and would permit advertising the node additionally as tICA node.

sbhakat commented 1 year ago

Hi @Stewori you are right. According to the paper http://redwood.psych.cornell.edu/discussion/papers/2006_SFA_ICA_blaschke_berkes_wiskott.pdf lag=1 TICA=SFA

That's not true (at least that's what I understood from the paper) if lag > 1. Hence I wanted to use the SFA module as to test how choice to lag affects the SFA variables. That will make the module compatible with tICA and can be used by bigger audience who are using molecular simulation.

tic-sfa