willwerscheid / flashier

A faster and angrier package for EBMF.
https://willwerscheid.github.io/flashier/
Other
10 stars 12 forks source link

Allow (sparse) Matrix in low rank input #105

Closed stephens999 closed 7 months ago

stephens999 commented 1 year ago

The use case is where X is a sparse matrix (from Matrix package) and we want to apply flashier to XX'.

I hoped this would work: flash(list(u=X,d=rep(1,ncol(X)),v=X))

but it reports an error

Error in must.be.supported.data.type(data, allow.null = FALSE, allow.lowrank = TRUE) :
Invalid argument to data.

This seems to be because it requires u and v to be a matrix (not Matrix).

Will it work to simply relax the check so it allows u, v to be Matrix?

willwerscheid commented 1 year ago

I am not sure whether it would work to just relax the check. I will look. How much do you really need this? If X is n by K and K is small then I do not think you would lose much by coercing to a matrix (as.matrix(X)). If K is large then I suspect you might be better off forming XX'. What is the specific use case?

stephens999 commented 1 year ago

I'm particularly motivated by trying to reduce the computation needed for the examples in Y Liu, P Carbonetto, J Willwerscheid, S Oakes, K Macleod and M Stephens. Dissecting tumor transcriptional heterogeneity from single-cell RNA-seq data by generalized binary covariance decomposition. bioRxiv doi:10.1101/2023.08.15.553436

More generally, it comes up when you want to apply flashier to the gram matrix XX', where X is an nxp sparse data matrix (could be single cell counts or text word counts for example).

Eg consider n = 100k (or larger), p = 10k, and X is sparse. Even forming XX' will be 100k100k10k work And then you have a dense 100k * 100k matrix to store in memory and run flashier on.

My hope is that if X is sparse it should be possible to run flashier on X without forming XX'

willwerscheid commented 1 year ago

Ok, thanks. I think we can do this by: 1. importing the necessary Matrix functions (colSums, rowSums, t, etc) into the namespace rather than accessing them directly (via Matrix::) (this is probably how I should have done it in the first place anyway); 2. relaxing the check in is.udv()

pcarbo commented 1 year ago

@willwerscheid Let me know if I can test your implementation, or if you are stuck.

YushaLiu commented 1 year ago

I tried flash(list(u=as.matrix(X),d=rep(1/ncol(X), ncol(X)),v=as.matrix(X))) if X is a sparse dgCMatrix, and it worked. But in GBCD we also need to decompose XX' - epsilon*I (where I is identity matrix), does this trick still work here?

willwerscheid commented 1 year ago

ok, version 1.0-11 should work now for "udv" objects where u and v are of class Matrix

willwerscheid commented 1 year ago

@YushaLiu i don't think so since XX' - \epsilon I is typically full rank

YushaLiu commented 1 year ago

@willwerscheid so does that mean that we are not able to use this trick to speed up GBCD, where the diagonal term seems unavoidable?

pcarbo commented 1 year ago

I'm looking through your code @willwerscheid. It seems like it shouldn't be too hard to extend your code to allow for UV^T + D, where D is diagonal, and is by default a matrix of zeros?

@YushaLiu I'm wondering if this could be helpful to address one of the reviewer comments about computational complexity, and specifically about handling large (single-cell) data sets?

stephens999 commented 1 year ago

and in general one should be able to do UV' + S where S is sparse?

willwerscheid commented 1 year ago

@pcarbo I would need to write new "nmode" functions, but you are correct that it would not be too hard (I did design the package to be easily extensible to new data structures...). What kind of timeline are we working with?

willwerscheid commented 1 year ago

@stephens999 yes that is smart. strictly speaking we would not need to enforce sparsity in S (even though it would be pretty pointless to have a dense matrix there). what would we prefer the data structure to look like? a list with "u", "v", and "s"? does a precedent for storing this kind of matrix come to mind that we can borrow from?

willwerscheid commented 1 year ago

i am doing a very quick review of the literature and so far i like "U", "V", and "S" (capital letters)

pcarbo commented 1 year ago

@willwerscheid What functions need to be updated for this besides nmode.prod.vec? I'm wondering if the interface would be simpler and more general if you took as input a function (or a list of functions). Would that work? For an example of this, see the svds function in the Rspectra package.

willwerscheid commented 1 year ago

@pcarbo I believe that all of the functions in ops_nmode_products.R would need to be updated, and I would not rule out that one or two functions elsewhere would need a look (if so they should probably be moved in case we want to do this again). i will think about whether it makes sense to make it general (similar to Rspectra) as I do the updates

pcarbo commented 1 year ago

Okay, in the interest of not overcomplicating then, you may prefer to stick with UV^T + S.

YushaLiu commented 1 year ago

I'm looking through your code @willwerscheid. It seems like it shouldn't be too hard to extend your code to allow for UV^T + D, where D is diagonal, and is by default a matrix of zeros?

@YushaLiu I'm wondering if this could be helpful to address one of the reviewer comments about computational complexity, and specifically about handling large (single-cell) data sets?

@pcarbo @willwerscheid Yes, if we can make the trick work for XX' - epsilon*I, it will greatly speed up GBCD for handling large datasets, where the number of cells is much greater than the number of genes. Karl has a dataset with 160k cells that he wants to apply GBCD to, but our current GBCD implementation cannot scale up to data of this size. And as Matthew said, if we can take advantage of the sparsity of X, that could lead to further speed up.

willwerscheid commented 7 months ago

this has now been implemented and is part of the current GBCD workflow