Closed KoichiHashikawa closed 1 month ago
Hi Koichi,
Thanks for your kind words, and happy to help!
I wonder if there is a limit in the size of features that we can feed into NMF. No, you just want to make sure all features have at least somewhat comparable signal, so a normalization may be needed if some features are several orders of magnitude stronger in signal than others.
but am not sure if 200-400 genes are enough to run NMF. As long as you can verify the decomposition is meaningful and consistent with domain knowledge...
RcppML::crossValidate gave optimal size of components as 4-5 You may want to look at these graphs yourself and see if you can't get by with a larger rank without dramatically increasing test set reconstruction error.
-Zach
Thanks so much Zach for quick turnaround and for sharing your deep insights, which are very helpful!! I will be careful selecting representative features in each components.
Hello,
We have been utilizing RcppML to identify disease related modules and have had great experiences. I really appreciate you have developed the fast and robust NMF package, which is an impactful contribution to our community.
I have a few questions and wanted to hear your insights.
feature selection, feature size In addition to using variable genes of scRNAseq data, I feel RcppML::NMF is working well with disease related genes (up-regulated DEGs in disease) to decompose ~1000-3000 DEGs into 20-50 co-regulated modules. I wonder if there is a limit in the size of features that we can feed into NMF. In some programs, we have ~200-400 DEGs derived in bulk-seq data. I thought NMF might be useful in decomposing these genes into some groups of modules using corresponding scRNAseq data, but am not sure if 200-400 genes are enough to run NMF. I tried, and in this particular case, RcppML::crossValidate gave optimal size of components as 4-5. (which is not too bad given feature size is small).
NMF score to selecting top genes in each component. It seems that summation of each component (gene x NMF components) is 1. I wonder what is a desirable way to determine top contributing genes in each component. I tried: a. z-score genes in each component and pick genes with Z>1.96 b. z-score genes in all components (since components are scaled to 1) and pick genes with Z>1.96 in each component. In Pelka et all., 2021, Cell (https://pubmed.ncbi.nlm.nih.gov/34450029/), authors did some scaling to order genes in each NMF component and then they pick top 100-150 genes in each component (inclusive than Z score way, but I wonder how to determine the right size of top genes (e.g. ~100 genes) from NMF.
Thanks so much!! Koichi