Closed stephens999 closed 7 months ago
I am not sure whether it would work to just relax the check. I will look. How much do you really need this? If X is n by K and K is small then I do not think you would lose much by coercing to a matrix (as.matrix(X)). If K is large then I suspect you might be better off forming XX'. What is the specific use case?
I'm particularly motivated by trying to reduce the computation needed for the examples in Y Liu, P Carbonetto, J Willwerscheid, S Oakes, K Macleod and M Stephens. Dissecting tumor transcriptional heterogeneity from single-cell RNA-seq data by generalized binary covariance decomposition. bioRxiv doi:10.1101/2023.08.15.553436
More generally, it comes up when you want to apply flashier to the gram matrix XX', where X is an nxp sparse data matrix (could be single cell counts or text word counts for example).
Eg consider n = 100k (or larger), p = 10k, and X is sparse. Even forming XX' will be 100k100k10k work And then you have a dense 100k * 100k matrix to store in memory and run flashier on.
My hope is that if X is sparse it should be possible to run flashier on X without forming XX'
Ok, thanks. I think we can do this by: 1. importing the necessary Matrix functions (colSums, rowSums, t, etc) into the namespace rather than accessing them directly (via Matrix::) (this is probably how I should have done it in the first place anyway); 2. relaxing the check in is.udv()
@willwerscheid Let me know if I can test your implementation, or if you are stuck.
I tried flash(list(u=as.matrix(X),d=rep(1/ncol(X), ncol(X)),v=as.matrix(X)))
if X
is a sparse dgCMatrix, and it worked. But in GBCD we also need to decompose XX' - epsilon*I (where I is identity matrix), does this trick still work here?
ok, version 1.0-11 should work now for "udv" objects where u
and v
are of class Matrix
@YushaLiu i don't think so since XX' - \epsilon I is typically full rank
@willwerscheid so does that mean that we are not able to use this trick to speed up GBCD, where the diagonal term seems unavoidable?
I'm looking through your code @willwerscheid. It seems like it shouldn't be too hard to extend your code to allow for UV^T + D
, where D
is diagonal, and is by default a matrix of zeros?
@YushaLiu I'm wondering if this could be helpful to address one of the reviewer comments about computational complexity, and specifically about handling large (single-cell) data sets?
and in general one should be able to do UV' + S where S is sparse?
@pcarbo I would need to write new "nmode" functions, but you are correct that it would not be too hard (I did design the package to be easily extensible to new data structures...). What kind of timeline are we working with?
@stephens999 yes that is smart. strictly speaking we would not need to enforce sparsity in S (even though it would be pretty pointless to have a dense matrix there). what would we prefer the data structure to look like? a list with "u", "v", and "s"? does a precedent for storing this kind of matrix come to mind that we can borrow from?
i am doing a very quick review of the literature and so far i like "U", "V", and "S" (capital letters)
@willwerscheid What functions need to be updated for this besides nmode.prod.vec
? I'm wondering if the interface would be simpler and more general if you took as input a function (or a list of functions). Would that work? For an example of this, see the svds
function in the Rspectra
package.
@pcarbo I believe that all of the functions in ops_nmode_products.R
would need to be updated, and I would not rule out that one or two functions elsewhere would need a look (if so they should probably be moved in case we want to do this again). i will think about whether it makes sense to make it general (similar to Rspectra
) as I do the updates
Okay, in the interest of not overcomplicating then, you may prefer to stick with UV^T + S
.
I'm looking through your code @willwerscheid. It seems like it shouldn't be too hard to extend your code to allow for
UV^T + D
, whereD
is diagonal, and is by default a matrix of zeros?@YushaLiu I'm wondering if this could be helpful to address one of the reviewer comments about computational complexity, and specifically about handling large (single-cell) data sets?
@pcarbo @willwerscheid Yes, if we can make the trick work for XX' - epsilon*I, it will greatly speed up GBCD for handling large datasets, where the number of cells is much greater than the number of genes. Karl has a dataset with 160k cells that he wants to apply GBCD to, but our current GBCD implementation cannot scale up to data of this size. And as Matthew said, if we can take advantage of the sparsity of X, that could lead to further speed up.
this has now been implemented and is part of the current GBCD workflow
The use case is where X is a sparse matrix (from Matrix package) and we want to apply flashier to XX'.
I hoped this would work:
flash(list(u=X,d=rep(1,ncol(X)),v=X))
but it reports an error
This seems to be because it requires u and v to be a matrix (not Matrix).
Will it work to simply relax the check so it allows u, v to be Matrix?