zdebruine / RcppML

Rcpp Machine Learning: Fast robust NMF, divisive clustering, and more
GNU General Public License v2.0
89 stars 15 forks source link

Interpretation of NMF results #54

Closed KoichiHashikawa closed 1 month ago

KoichiHashikawa commented 1 month ago

Hello,

We have been utilizing RcppML to identify disease related modules and have had great experiences. I really appreciate you have developed the fast and robust NMF package, which is an impactful contribution to our community.

I have a few questions and wanted to hear your insights.

  1. feature selection, feature size In addition to using variable genes of scRNAseq data, I feel RcppML::NMF is working well with disease related genes (up-regulated DEGs in disease) to decompose ~1000-3000 DEGs into 20-50 co-regulated modules. I wonder if there is a limit in the size of features that we can feed into NMF. In some programs, we have ~200-400 DEGs derived in bulk-seq data. I thought NMF might be useful in decomposing these genes into some groups of modules using corresponding scRNAseq data, but am not sure if 200-400 genes are enough to run NMF. I tried, and in this particular case, RcppML::crossValidate gave optimal size of components as 4-5. (which is not too bad given feature size is small).

  2. NMF score to selecting top genes in each component. It seems that summation of each component (gene x NMF components) is 1. I wonder what is a desirable way to determine top contributing genes in each component. I tried: a. z-score genes in each component and pick genes with Z>1.96 b. z-score genes in all components (since components are scaled to 1) and pick genes with Z>1.96 in each component. In Pelka et all., 2021, Cell (https://pubmed.ncbi.nlm.nih.gov/34450029/), authors did some scaling to order genes in each NMF component and then they pick top 100-150 genes in each component (inclusive than Z score way, but I wonder how to determine the right size of top genes (e.g. ~100 genes) from NMF.

Thanks so much!! Koichi

zdebruine commented 1 month ago

Hi Koichi,

Thanks for your kind words, and happy to help!

  1. Unlike PCA, which can become less useful when non highly-variable genes are included in the analysis, NMF can handle a very large feature space regardless of whether features are highly variable or not. It is a nature of the additive decomposition.

I wonder if there is a limit in the size of features that we can feed into NMF. No, you just want to make sure all features have at least somewhat comparable signal, so a normalization may be needed if some features are several orders of magnitude stronger in signal than others.

but am not sure if 200-400 genes are enough to run NMF. As long as you can verify the decomposition is meaningful and consistent with domain knowledge...

RcppML::crossValidate gave optimal size of components as 4-5 You may want to look at these graphs yourself and see if you can't get by with a larger rank without dramatically increasing test set reconstruction error.

  1. NMF score to selecting top genes in each component. You want to be very cautious doing this, and it is and will always be a subjective process. Each factor has a different "shape" or "distribution" of loadings. In single-cell transcriptomics models, generally the factor explaining the most signal (NMF_1) will have a large number of features moderately enriched, while the factor explaining the least signal (NMF_k) will have just a few features significantly enriched and all others very low. However, different datasets with different decomposition characteristics will result in different properties. I have no solution for this, and it is an exercise for us as much as you which depends on what you are factorizing and how you are using the model to inform downstream analysis.

-Zach

KoichiHashikawa commented 1 month ago

Thanks so much Zach for quick turnaround and for sharing your deep insights, which are very helpful!! I will be careful selecting representative features in each components.