zdebruine / RcppML

Rcpp Machine Learning: Fast robust NMF, divisive clustering, and more
GNU General Public License v2.0
89 stars 15 forks source link

how to set seed? #46

Closed KoichiHashikawa closed 10 months ago

KoichiHashikawa commented 10 months ago

Hello!

I really appreciate you have shared the fantastic package to the community. I have been utilizing your RcppML::nmf for a while and find it very useful in identifying cellular programs in scRNAseq data.

I have a few questions around nmf as below. 1) what is the optimal seed value? I found that "seed" affects the results a bit (largely similar results irrespective of seed values). What if I choose large value like 1000 or small value like 0? What/how does it make differences?

2) I had a chance to run scikit learn's nmf with the same data. Although it largely provided similar results, I see a few differences. I wonder what components in the code could make differences in the final outputs. Is your nmf independent from scikit learn?

3) For single cell RNAseq data, can we use raw count or log-normalized data as inputs to nmf?

Thank you so much!

Koichi

zdebruine commented 10 months ago

Hi Koichi,

Glad you find this useful!

  1. There is no optimal seed value. This is a purely random number that feeds into a pseudorandom number generator (like all good RNGs). You should pick values in the range 1 to INT_MAX, never 0 as this will challenge almost any RNG. The seed is useful only for reproducibility, not much else. Remember that NMF is an approximate method, and does not find exact solutions. Use set.seed(123) to set the seed.

  2. This is different than scikit learn, RcppML has re-implemented NMF in basic arithmetic in C++. Same in theory, but uses faster algorithms all around (and it is much faster).

  3. Log-normalized data, not raw count. Other normalizations, such as VST work well too.

Let me know if you have further questions!