wrathematics / coop

Fast covariance, correlation, and cosine similarity.
Other
35 stars 6 forks source link

Computing a "band" of a correlation matrix #5

Open PeteHaitch opened 8 years ago

PeteHaitch commented 8 years ago

Great package, Drew. I'm already benefiting from it, thank you.

I have a question/feature request that I hope might be of interest to you. Suppose I have a "wide" matrix, X, with r rows and c columns where c > r. Computing cor(X) or coop::pcor(X) gives me a square matrix, C, with c rows and columns. C is typically large since it contains c/r-times more elements than X. But I'm only interested in a "band" of C around the diagonal, bC, say, the main diagonal and k off-diagonals. My intuition could well be wrong, but it feels like there ought to be shortcuts to computing bC instead of C (and there will certainly be storage savings by only computing/returning bC). Do you think such banded computations are:

Thanks

wrathematics commented 8 years ago

Hi Peter, thanks for the kind words!

The short version is that in some cases it would be faster, particularly for the pairwise-complete NA handling. Some cases would be slower (even assuming k < min(r, c)). But I think it's a great idea, and something I have a bit of interest in myself (dimension reduction with c>r matrices, for which this might be useful).

The computation is fairly straightforward. I'm not aware of any packages on CRAN that offer banded storage though (Matrix seems to use sparse storage, which is not ideal). I can easily write one, but if there's already a standard that I just don't know about, I'd rather "plug in" to it. Are you aware of one?

PeteHaitch commented 8 years ago

Yes, the returned object isn't a true banded matrix (which has zeros outside the band) but rather has "not computed" (NA?) outside the band. So I'm unsure the best way to store/represent this.

My current code is something like this:

C <- pcor(X)
lapply(0:k, function(k) subDiag(C, k))

where subDiag() extracts the k-th subdiagonal (k = 0 corresponding to diag()). So I'm returning a list of length k + 1 rather than a matrix. That works well for my case but may not generalise well to others.

wrathematics commented 8 years ago

I've started work on the storage and utilities needed for this here https://github.com/wrathematics/band There are a few more utilities I need to create before I can start the work on the coop side of things, but I don't think it'll take too much effort.

Unfortunately I probably won't get a chance to look at this much over the next few weeks, as I'm supposed to be studying for phd qualifying exams. But this is high on my priority list for this summer.

PeteHaitch commented 8 years ago

Wow, this looks great. I had no idea there was LAPACK support for these types of matrices (well, I know rather little of LAPACK so perhaps this isn't so surprising).

No worries re timelines for this stuff and best wishes for your quals! I really appreciate your interest and help with this.

wrathematics commented 8 years ago

Thanks!

LAPACK and the BLAS have some limited support for symmetric ("packed") and band storages. But they mostly never bothered finishing, I think because it was so much slower for most applications. But particularly for the banded symmetric case with very narrow bands (i.e., a sparse matrix), I think it may win out on time, and will definitely be advantageous on space. It'll be a fun experiment either way!