zdebruine / RcppML

Rcpp Machine Learning: Fast robust NMF, divisive clustering, and more
GNU General Public License v2.0
89 stars 15 forks source link

Is number of iterations for conversion significant? #42

Closed wudustan closed 1 year ago

wudustan commented 1 year ago

I'm running NMF on several 10x samples with a tol of 1e-08. For some samples ~100 iterations is enough to converge for others it gets closer to 1000.

Does this indicate anything specific about the underlying data / chosen k ? I ran crossValidate and selected a k that looked suitable among all samples rather than changing k per sample - could this be the reason?

What would you suggest the best approach here to be?

Cheers

zdebruine commented 1 year ago

tol = 1e-8 is much more than you probably need, I usually use no more than tol = 1e-5. 100 iterations is likely more than enough for almost any factorization.

I'm not sure what you mean by running crossValidate on all samples... usually you would combine all samples together and then run NMF? Do you mean batches? If so, it may be helpful to combine all samples from all batches and then learn a joint model. You can later look for factors that capture batch effects.

A bit more explanation or figures of your cross-validation results would be helpful!

wudustan commented 1 year ago

I'm not sure what you mean by running crossValidate on all samples

I'm trying to replicate the methodology in this paper. The author ran snmf/r from the NMF package on a counts matrix from each tumour separately, derived the metagenes then integrated the common metagenes into 'metaprogrammes' of expression.

To that end they ran NMF on each sample individually with the same hypervariable genes subset and for a set rank.

Having ran crossValidate on raw counts of my samples for max rank 30 here is my cv plot:

plot_zoom_png-11

If I run the full set of samples together as a single matrix:

plot_zoom_png-12

zdebruine commented 1 year ago

Nice, thanks!

Three possibilities to consider:

  1. The best rank is actually one. By "best", I mean the rank that will generalize best to any randomly withheld or unseen test set. I highly doubt this is the case, at least between samples.
  2. The data is not properly normalized. A Variance-stabilizing Normalization, or pseudo-VST such as standard log-normalization works well for single-cell data. Avoid TF-IDF (common for scATAC) as it moves data from a negative binomial distribution to a frequency distribution, which is less suited for NMF as it is not a linearly additive measure of feature contribution to cellular state.
  3. Lack of statistical power. Complex and noisy signals can require large sample sizes to yield robust recovery of a good solution.

The concept of the original authors to extract metagenes is very attractive in theory, and you can do as you wish as you seek to replicate their results in practice, but if you cannot learn generalizable factors from individual samples, you are likely not sufficiently powered and need more data. Alternative modeling frameworks could also be explored, such as NNLS. The primary danger here is that even if results appear to provide true positive gene associations within metagenes, to the point the biology looks beautiful, an overfit NMF model will nevertheless lead to a higher false-positive/true-positive rate than simply at a rank of 1, and thus be more misleading than helpful. I didn't read the entire paper, but I did read the methods section on NMF and frankly it stinks of lack of statistical/scientific rigor (arbitrary choice of six factors for each sample, hard threshold selection of top k genes, etc. etc.), I am very surprised these issues were not raised during peer review. This is a wonderful application of the method, and I'd love to see it work, so let me know if you find any solutions!