zdebruine / RcppML

Rcpp Machine Learning: Fast robust NMF, divisive clustering, and more
GNU General Public License v2.0
89 stars 15 forks source link

How to emulate functionality from NMF package for estimating the factorization rank #39

Closed wudustan closed 1 year ago

wudustan commented 1 year ago

I am interested in using this package to run an NMF analysis on some sparse data. The NMF R package from CRAN, provides the NMF rank survey functionality when using a range of values for k. I was wondering whether it's possible to obtain the sparseness/cophenetic/silhouette metrics per rank from RcppML::nmf()?

zdebruine commented 1 year ago

@wudustan try using the dev version from GitHub and check out the crossValidate function. This function applies Wold's Monte Carlo cross-validation to determine the optimal rank -- it is much better than any of cophenetic/silhouette metrics for determining rank. Let me know if you have questions on how to use that function!

wudustan commented 1 year ago

I'm getting the following error:

Error in Rcpp_nmf_dense(data, mask_matrix, tol, maxit, getOption("RcppML.verbose"),  : 
  c++ exception (unknown reason)

Could you suggest a reason?

My data:

> dim(b169.dmso.frozen.nmf.mat)
[1] 3000 5907

> b169.dmso.frozen.nmf.mat[1:5, 1:5]
         B169.DMSO.FROZEN_1 B169.DMSO.FROZEN_3 B169.DMSO.FROZEN_5 B169.DMSO.FROZEN_7 B169.DMSO.FROZEN_8
VIM                4.346295           7.397843           8.200908           5.429212           8.478549
HIST1H4C           9.237470           6.009916           8.941542           5.770367           6.146523
UBE2C              7.773624           2.964877           7.697491          10.865464          10.649683
CENPF              6.568869           5.974733           7.942087           9.413728           9.651910
HIST1H1E           6.746004           6.719570           7.642481           1.937257           5.363575

> class(b169.dmso.frozen.nmf.mat)
[1] "matrix" "array" 

The function call: b169.dmso.frozen.ranks <- crossValidate(b169.dmso.frozen.nmf.mat, k = 2:8, verbose = TRUE)

GCC:

more ~/.R/Makevars
VER=-12
CC=gcc$(VER)
CXX=g++$(VER)
CXX11=g++$(VER)
CXX14=g++$(VER)
CXX17=g++$(VER)
CFLAGS=-mtune=native -g -O2 -Wall -pedantic -Wconversion
CXXFLAGS=-mtune=native -g -O2 -Wall -pedantic -Wconversion
FLIBS=-L/usr/local/Cellar/gcc/12.2.0/lib/gcc/12

Session info:

R version 4.2.2 (2022-10-31)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Monterey 12.6.2

Matrix products: default
LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:

[1] RcppML_0.5.6                pagoda2_1.0.10              igraph_1.3.5                Matrix_1.5-3               
 [5] reshape2_1.4.4              EnsDb.Hsapiens.v75_2.99.0   ensembldb_2.20.2            AnnotationFilter_1.20.0    
 [9] GenomicFeatures_1.48.4      AnnotationDbi_1.58.0        kneedle_1.0.0               readxl_1.4.1               
[13] UCell_2.0.1                 hues_0.2.0.9000             patchwork_1.1.2             schex_1.10.0               
[17] shiny_1.7.4                 cowplot_1.1.1               scater_1.24.0               scuttle_1.6.3              
[21] SingleCellExperiment_1.18.1 harmony_0.1.1               Rcpp_1.0.9                  SeuratWrappers_0.3.1       
[25] Seurat_4.3.0                SeuratObject_4.1.3          sp_1.5-1                    dplyr_1.0.10               
[29] ggplot2_3.4.0               stringr_1.5.0               ShortRead_1.54.0            GenomicAlignments_1.32.1   
[33] SummarizedExperiment_1.26.1 Biobase_2.56.0              MatrixGenerics_1.8.1        matrixStats_0.63.0         
[37] Rsamtools_2.12.0            GenomicRanges_1.48.0        Biostrings_2.64.1           GenomeInfoDb_1.32.4        
[41] XVector_0.36.0              IRanges_2.30.1              S4Vectors_0.34.0            BiocParallel_1.30.4        
[45] BiocGenerics_0.42.0        
zdebruine commented 1 year ago

I am not able to replicate using random dense matrices. I'm wondering if there is something specific to your matrix that is causing this?

set.seed(123)
A <- matrix(runif(300 * 500), 300, 500)
model <- RcppML::crossValidate(A, 2:8, verbose = TRUE)

I am not sure if you can provide a reproducible example somehow?

wudustan commented 1 year ago

This is very weird. At the office it wasn't running. I got home and saw your message, gave it another try and now it's running fine. The only change is I opened a new R session but restarting R was the first thing i tried when I got the c++ error. Thank you for taking the time to try and replicate. I'm really not sure what the issue was.

The speed of this is very impressive. Could you give me some info on how to interpret the output of the crossValidate() I ran it for 2:20 ranks.

ffcdfdbb-c986-4b40-8d80-0ad0058d6f90

zdebruine commented 1 year ago

Sure, glad it works for you. The optimal rank appears to be somewhere around k = 15. I'd do + scale_y_continuous(limits = c(1, 1.1) and that might help you get a better idea. Those last outliers are screwing with the visualization a bit.

wudustan commented 1 year ago

aede28df-c5c7-49f1-86d4-71008d8a53eb Here it is constrained. Do you think it's worth going beyond 20 ranks?

zdebruine commented 1 year ago

No, you want k = 12.

wudustan commented 1 year ago

Thanks for your help.