Closed mooreann closed 6 years ago
Hi,
Thanks for pointing this out! So I've updated SWNE to automatically convert data frames or dense matrices to dgCMatrices. Unfortunately some of the C++ functions I'm using require them to be in dgCMatrix format so it currently has to be sparse.
Let me know if this helps! Yan
Thanks for the response. I've gotten past the formatting issue and am now running the nmf decomposition. My object is quite large, with dimensions of about 50,000 by 16,000 and I've yet to wait long enough for the nmf decomp to complete since it seems to take quite a while. Have you tested some of your function with data sets this large or have an estimate on how long it may take to process some of the steps? Thanks in advance.
Ah yes this is definitely a flaw with SWNE right now and we're currently working on decreasing the runtime.
So the main bottlenecks are the ICA initialization and then NMF, both of which scale very well with the number of columns but can be slow with large amounts of features (rows). I'd imagine 50,000 could take quite a while. I'd suggest using random initialization instead of ICA, which could speed up the runtime by quite a bit. You could also try decreasing the max.iter
in the RunNMF
function.
If there's any way to do feature selection for your dataset, I'd recommend trying that as well.
Sorry to keep the questions coming, but in the method preprocessing with Seurat and importing the Seurat object in the SWNE package, it is possible to use the preclustered object/is SWNE able to do clustering, or is that step essential in Seurat and SWNE is primarily just for visualizing those already created clusters? And in terms of finding k with the lowest error, does that range to check only go up to 10? I keep getting errors when a test with numbers higher, though in my Seurat analysis I used 20 PCs.
No worries! I'm always happy to answer questions. So SWNE doesn't do any sort of clustering by itself, it's primarily for visualization. You can pull the clusters out of the Seurat object using something like:
clusters <- seurat@ident
names(clusters) <- seurat@cell.names
Hmm the range of k shouldn't stop at 10. Can you share a screenshot of the error you're getting? It may have something to do with memory constraints since it seems like your input matrix is very large. Also k isn't constrained by the number of PCs you've used, they're two separate parameters.
I do understand that PCs and k are different, but in your Seurat/SWNE walkthrough you mentioned that a hack was to set k equal to the number of PCs if one wanted to avoid running FindNumFactors so I'm assuming the number of ks would be somewhere around that number. In my case I used 20 PCs in my Seurat clustering, but when I change the range of ks to run through to find the best value (code): k.res <- FindNumFactors(norm.counts[var.genes,], k.range = seq(10,20,10), n.cores = 16, do.plot = T, loss = "mse") I get the following error message: Error in data.frame(y = err.del.combined, x = factor(names(err.del.combined), : arguments imply differing number of rows: 2, 0
I've tried several ranges of numbers and find that any time the max number is over 10 I seem to get this error.
Ahh I think I figured it out, there's a bug when you're only trying two values of k. In this case you're only trying 10 and 20. Would it be possible to do something like k.range = seq(10,20,5)
for now?
Ok last quick question regarding picking a k. When I get the graph of the amount of error for different values of k I'm getting two lines, one labeled as original. another labeled as randomized. Should I consider one of these values over the other when picking a k, or rather what are they referring to?
Ah can you update SWNE? We've made it so that there's only one line, and you're looking for the k
when that line first crosses zero. We've also changed it back so that the output of FindNumFactors
returns a list again, and one of the list elements is the optimal k
The line refers to the error reduction above noise. So every time we increase the value of k
, we look at the decrease reconstruction error for your input matrix, and then the decrease in reconstruction error for a randomized input matrix. Once the decrease in reconstruction error for the input matrix falls below the decrease in reconstruction error for the randomized matrix, there's no more benefit to increasing k
so we keep that value of k
. Sorry I know this can be a little confusing.
Let me know if you have any more questions!
Also I'm currently having an issue embedding genes onto the plot, Its currently using 18 factors based on the line labeled as original from the graph output of FindNumFactors but when I get to run EmbedFeatures I'm getting this error: Error in feature.assoc[features.embed, ] : subscript out of bounds
Currently this is my code: nmf.resMouse$W <- ProjectFeatures(norm.countsMouse, nmf.scoresMouse, loss = "mse", n.cores = 16)
swne.embeddingMouse <- EmbedFeatures(swne.embeddingMouse, nmf.resMouse$W, genes.embed, n_pull = 4)
I'm just not sure with my number of factors being so high is there another variable value I need to increase as well to compensate. Thank you so much for all your help at this point by the way.
No worries! Can you check if all the genes.embed
are in your gene expression matrix? Something like all(genes.embed %in% rownames(nmf.resMouse$W))
A quick fix could be genes.embed <- genes.embed[genes.embed %in% rownames(nmf.resMouse$W)]
Ok yea after checking my genes names I realized that problem, but now when I run FindNumFactors after updating installation for a couple of data sets I'm getting this which I've never seen before:
no non-missing arguments to min; returning InfError in k.range[[min.idx]] : subscript out of bounds
with code: k.resMouse <- FindNumFactors(norm.countsMouse[var.genesMouse,], k.range = seq(10,20,2), n.cores = 16, do.plot = T, loss = "mse")
Is there a lower limit to the numerical value of k to check?
Hi @yanwu2014 I am also now running into this same issue with the updated swne. Any information as to the change in functionality?
Best, Dylan
Should be fixed in the latest commit.
Sorry about that, FindNumFactors
was looking for the first value of k
where the error reduction above noise falls below zero. If that didn't happen, then it would throw an error. Now FindNumFactors
looks for the minimum error reduction above noise if it never crosses zero.
Closing this thread if there are no more issues!
I'm trying to filter the data of my Seurat matrix and extract the norm counts based off the posted walk through. First I was wondering how necessary it is to filter given my imported object was already filtered in Seurat. My bigger question though, is that both the filtering and extracting norm counts seem to require the raw data of the object to be an S4 object of dgCMatrix class. The current object I'm trying to use has the data in this format, but the raw data is still an S3 data frame. Is there any way to get around this or is there a way to reformat the raw data within the object?