yanwu2014 / swne

Similarity Weighted Nonnegative Embedding (SWNE), a method for visualizing high dimensional datasets
BSD 3-Clause "New" or "Revised" License
103 stars 20 forks source link

primary steps uploading seurat object #6

Closed mooreann closed 6 years ago

mooreann commented 6 years ago

I'm trying to filter the data of my Seurat matrix and extract the norm counts based off the posted walk through. First I was wondering how necessary it is to filter given my imported object was already filtered in Seurat. My bigger question though, is that both the filtering and extracting norm counts seem to require the raw data of the object to be an S4 object of dgCMatrix class. The current object I'm trying to use has the data in this format, but the raw data is still an S3 data frame. Is there any way to get around this or is there a way to reformat the raw data within the object?

yanwu2014 commented 6 years ago

Hi,

Thanks for pointing this out! So I've updated SWNE to automatically convert data frames or dense matrices to dgCMatrices. Unfortunately some of the C++ functions I'm using require them to be in dgCMatrix format so it currently has to be sparse.

Let me know if this helps! Yan

mooreann commented 6 years ago

Thanks for the response. I've gotten past the formatting issue and am now running the nmf decomposition. My object is quite large, with dimensions of about 50,000 by 16,000 and I've yet to wait long enough for the nmf decomp to complete since it seems to take quite a while. Have you tested some of your function with data sets this large or have an estimate on how long it may take to process some of the steps? Thanks in advance.

yanwu2014 commented 6 years ago

Ah yes this is definitely a flaw with SWNE right now and we're currently working on decreasing the runtime.

So the main bottlenecks are the ICA initialization and then NMF, both of which scale very well with the number of columns but can be slow with large amounts of features (rows). I'd imagine 50,000 could take quite a while. I'd suggest using random initialization instead of ICA, which could speed up the runtime by quite a bit. You could also try decreasing the max.iter in the RunNMF function.

If there's any way to do feature selection for your dataset, I'd recommend trying that as well.

mooreann commented 6 years ago

Sorry to keep the questions coming, but in the method preprocessing with Seurat and importing the Seurat object in the SWNE package, it is possible to use the preclustered object/is SWNE able to do clustering, or is that step essential in Seurat and SWNE is primarily just for visualizing those already created clusters? And in terms of finding k with the lowest error, does that range to check only go up to 10? I keep getting errors when a test with numbers higher, though in my Seurat analysis I used 20 PCs.

yanwu2014 commented 6 years ago

No worries! I'm always happy to answer questions. So SWNE doesn't do any sort of clustering by itself, it's primarily for visualization. You can pull the clusters out of the Seurat object using something like: clusters <- seurat@ident names(clusters) <- seurat@cell.names

Hmm the range of k shouldn't stop at 10. Can you share a screenshot of the error you're getting? It may have something to do with memory constraints since it seems like your input matrix is very large. Also k isn't constrained by the number of PCs you've used, they're two separate parameters.

mooreann commented 6 years ago

I do understand that PCs and k are different, but in your Seurat/SWNE walkthrough you mentioned that a hack was to set k equal to the number of PCs if one wanted to avoid running FindNumFactors so I'm assuming the number of ks would be somewhere around that number. In my case I used 20 PCs in my Seurat clustering, but when I change the range of ks to run through to find the best value (code): k.res <- FindNumFactors(norm.counts[var.genes,], k.range = seq(10,20,10), n.cores = 16, do.plot = T, loss = "mse") I get the following error message: Error in data.frame(y = err.del.combined, x = factor(names(err.del.combined), : arguments imply differing number of rows: 2, 0

I've tried several ranges of numbers and find that any time the max number is over 10 I seem to get this error.

yanwu2014 commented 6 years ago

Ahh I think I figured it out, there's a bug when you're only trying two values of k. In this case you're only trying 10 and 20. Would it be possible to do something like k.range = seq(10,20,5) for now?

mooreann commented 6 years ago

Ok last quick question regarding picking a k. When I get the graph of the amount of error for different values of k I'm getting two lines, one labeled as original. another labeled as randomized. Should I consider one of these values over the other when picking a k, or rather what are they referring to?

yanwu2014 commented 6 years ago

Ah can you update SWNE? We've made it so that there's only one line, and you're looking for the k when that line first crosses zero. We've also changed it back so that the output of FindNumFactors returns a list again, and one of the list elements is the optimal k

The line refers to the error reduction above noise. So every time we increase the value of k, we look at the decrease reconstruction error for your input matrix, and then the decrease in reconstruction error for a randomized input matrix. Once the decrease in reconstruction error for the input matrix falls below the decrease in reconstruction error for the randomized matrix, there's no more benefit to increasing k so we keep that value of k. Sorry I know this can be a little confusing.

Let me know if you have any more questions!

mooreann commented 6 years ago

Also I'm currently having an issue embedding genes onto the plot, Its currently using 18 factors based on the line labeled as original from the graph output of FindNumFactors but when I get to run EmbedFeatures I'm getting this error: Error in feature.assoc[features.embed, ] : subscript out of bounds

Currently this is my code: nmf.resMouse$W <- ProjectFeatures(norm.countsMouse, nmf.scoresMouse, loss = "mse", n.cores = 16)

remake plot w key genes

swne.embeddingMouse <- EmbedFeatures(swne.embeddingMouse, nmf.resMouse$W, genes.embed, n_pull = 4)

I'm just not sure with my number of factors being so high is there another variable value I need to increase as well to compensate. Thank you so much for all your help at this point by the way.

yanwu2014 commented 6 years ago

No worries! Can you check if all the genes.embed are in your gene expression matrix? Something like all(genes.embed %in% rownames(nmf.resMouse$W))

A quick fix could be genes.embed <- genes.embed[genes.embed %in% rownames(nmf.resMouse$W)]

mooreann commented 6 years ago

Ok yea after checking my genes names I realized that problem, but now when I run FindNumFactors after updating installation for a couple of data sets I'm getting this which I've never seen before:

no non-missing arguments to min; returning InfError in k.range[[min.idx]] : subscript out of bounds

with code: k.resMouse <- FindNumFactors(norm.countsMouse[var.genesMouse,], k.range = seq(10,20,2), n.cores = 16, do.plot = T, loss = "mse")

Is there a lower limit to the numerical value of k to check?

dylanmr commented 6 years ago

Hi @yanwu2014 I am also now running into this same issue with the updated swne. Any information as to the change in functionality?

Best, Dylan

yanwu2014 commented 6 years ago

Should be fixed in the latest commit.

Sorry about that, FindNumFactors was looking for the first value of k where the error reduction above noise falls below zero. If that didn't happen, then it would throw an error. Now FindNumFactors looks for the minimum error reduction above noise if it never crosses zero.

yanwu2014 commented 6 years ago

Closing this thread if there are no more issues!